2604.16881v1 Apr 18, 2026 cs.CL

강화 학습과 검증 가능한 보상을 이용한 매개변수 지식 활용 촉진: 교차 문화 개체 번역

Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

Longyue Wang

Citations: 149

h-index: 6

Weihua Luo

Citations: 738

h-index: 13

Hao Wang

Citations: 3

h-index: 1

Jiang Zhou

Citations: 10

h-index: 2

Xinwei Wu

Citations: 3

h-index: 1

Linlong Xu

Citations: 17

h-index: 2

Xiaohu Zhao

Citations: 21

h-index: 3

Tianyu Dong

Citations: 15

h-index: 3

Hengyu Liu

Citations: 433

h-index: 5

Deyi Xiong

Citations: 185

h-index: 7

Yangyang Liu

Citations: 9

h-index: 1

교차 문화 개체 번역은 대규모 언어 모델(LLM)에게 여전히 어려운 과제이며, 문자 그대로의 번역이나 음운학적 번역이 문맥에 맞는 문화적으로 적절한 번역 대신 자주 생성됩니다. 그러나 관련 지식은 대규모 사전 훈련 과정에서 모델의 매개변수에 이미 내재되어 있을 수 있습니다. 우리는 이러한 매개변수 지식의 효과적인 활용을 장려하기 위해, 외부 지식 베이스에 의존하지 않고 교차 문화 개체 번역을 최적화하는 훈련 프레임워크인 EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards)을 제안합니다. EA-RLVR은 검증 가능한 개체 수준의 보상 신호에 기반하여 학습을 안내하고, 최적화를 안정화하기 위해 경량 구조적 게이트를 통합합니다. 이러한 설계는 모델이 단순히 참조 번역을 모방하는 것이 아니라, 강력한 추론 과정을 학습하도록 유도합니다. 우리는 XC-Translate 데이터셋에서 EA-RLVR을 평가하고, 개체 번역 정확도와 일반화 성능 모두에서 일관된 개선을 관찰했습니다. 특히, 7천 개의 샘플로만 훈련했을 때, Qwen3-14B의 개체 번역 정확도가 50k 테스트 세트에서 23.66%에서 31.87%로 향상되었습니다. 학습된 개체 번역 능력은 일반 번역에도 적용되어, WMT24++ 데이터셋에서 +1.35 XCOMET의 성능 향상을 보였으며, 추가적인 최적화를 통해 +1.59까지 향상되었습니다. $pass@k$ 동역학 및 보상 설계에 대한 광범위한 분석 결과, 이러한 성능 향상은 우수한 샘플링 효율성과 안정적인 최적화 환경에 기인하는 것으로 나타났습니다.

Original Abstract

Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66\% to 31.87\% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of $pass@k$ dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!