2605.27846v1 May 27, 2026 cs.AI

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Yujing Wang
Yujing Wang
Citations: 25
h-index: 2
Yuwei Miao
Yuwei Miao
Citations: 29
h-index: 3
Gen Li
Gen Li
Citations: 25
h-index: 2
Yunsheng Zeng
Yunsheng Zeng
Citations: 47
h-index: 4
Siyu Chen
Siyu Chen
Citations: 16
h-index: 2
Yu Qiao
Yu Qiao
Citations: 1,704
h-index: 5
Jianwei Lv
Jianwei Lv
Citations: 121
h-index: 4
Bo Yuan
Bo Yuan
Citations: 47
h-index: 2
Xiandong Li
Xiandong Li
Citations: 3
h-index: 1
Junfeng Wang
Junfeng Wang
Citations: 57
h-index: 3
Luning Wang
Luning Wang
Citations: 362
h-index: 3

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

0 Citations
0 Influential
2.5 Altmetric
12.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!