2603.16157v1 Mar 17, 2026 cs.LG

DyJR: 동적 젠슨-섀넌 리플레이를 이용한 검증 가능한 보상을 통한 강화 학습에서의 다양성 보존

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Long Li

Citations: 81

h-index: 5

Zhijian Zhou

Citations: 48

h-index: 4

Zhe Wang

Citations: 82

h-index: 5

Shirui Pan

Citations: 35

h-index: 2

Chao Qu

Citations: 29

h-index: 4

Zuming Huang

Citations: 311

h-index: 3

Wei Chu

Citations: 322

h-index: 3

Yuan Qi

Citations: 721

h-index: 11

Tianyi Wang

Citations: 31

h-index: 3

Weidi Xu

Citations: 864

h-index: 13

강화 학습(RL)은 대규모 언어 모델의 추론 능력을 향상시키지만, GRPO와 같은 온-폴리시 알고리즘은 과거 시뮬레이션 데이터를 버리기 때문에 샘플 효율성이 낮습니다. 기존의 경험 리플레이 방법은 정확한 샘플을 재사용하여 직접적인 정책 업데이트를 수행하지만, 이는 높은 계산 비용을 초래하고 과적합을 통해 모드 붕괴를 일으키는 경우가 많습니다. 본 연구에서는 과거 데이터가 단순히 정확성을 강화하는 것보다 다양성을 유지하는 데 우선순위를 두어야 한다고 주장합니다. 이를 위해, 최근 추적 데이터로부터 동적인 참조 분포를 사용하는 간단하면서도 효과적인 정규화 프레임워크인 동적 젠슨-섀넌 리플레이(DyJR)를 제안합니다. DyJR은 다음과 같은 두 가지 혁신을 도입합니다: (1) 시간 민감 동적 버퍼: FIFO 방식과 적응적 크기 조절을 사용하여 시간적으로 근접한 샘플만 유지하며, 모델의 진화와 동기화합니다. (2) 젠슨-섀넌 발산 정규화: 직접적인 그래디언트 업데이트를 분포적 제약 조건으로 대체하여 다양성 붕괴를 방지합니다. 수학적 추론 및 텍스트-SQL 벤치마크 실험 결과, DyJR은 GRPO뿐만 아니라 RLEP 및 Ex-GRPO와 같은 기존 방법보다 훨씬 뛰어난 성능을 보이며, 원래 GRPO와 유사한 수준의 학습 효율성을 유지합니다. 또한, 순위-k 토큰 확률 변화의 관점에서, DyJR이 다양성을 향상시키고 순위-1 토큰에 대한 과도한 의존성을 완화하는 것을 보여주며, DyJR의 특정 하위 모듈이 학습 역학에 미치는 영향을 설명합니다.

Original Abstract

While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.

2 Citations

0 Influential

6.5 Altmetric

34.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!