2601.19620v2 Jan 27, 2026 cs.LG

R^3: LLM 강화 학습을 위한 리플레이, 리플렉션 및 랭킹 보상

R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning

Jian Luan

Citations: 828

h-index: 12

Wei Liu

Citations: 7

h-index: 1

Zhizheng Jiang

Citations: 10

h-index: 1

K. Zhao

Citations: 44

h-index: 2

Weikai Xu

Citations: 222

h-index: 6

Xin Lin

Citations: 17

h-index: 2

Shuo Shang

Citations: 19

h-index: 2

Peng Han

Citations: 121

h-index: 4

대규모 추론 모델(LRM)은 체계적인 추론을 통해 다양한 복잡한 문제를 해결하는 것을 목표로 합니다. 최근 그룹 기반 정책 최적화 방법의 발전은 프로세스 레벨의 주석 없이 안정적인 이점 추정을 가능하게 한다는 가능성을 보여주었습니다. 그러나 이러한 방법은 동일 배치 내의 고품질 샘플에 의해 유발되는 이점 차이에 의존하며, 이는 어려운 작업에서 그룹 내 이점이 감소하면 학습 프로세스가 불안정하고 비효율적이 되게 합니다. 이러한 문제점을 해결하기 위해, 우리는 세 가지 방향으로 작동하는 강화 학습 메커니즘인 extit{ extbf{R^3}}을 제안합니다. (1) 동일 쿼리의 과거 트래jectory에서 유용한 예제를 다시 불러와 그룹 내 이점을 유지하는 extit{컨텍스트 간 extbf{R}eplay} 전략, (2) 모델이 과거의 실패를 활용하여 출력을 개선할 수 있도록 하는 extit{컨텍스트 내 자기 extbf{R}eflection} 메커니즘, 그리고 (3) 토큰 레벨의 엔트로피 패턴을 기반으로 응답을 순위화하여 로컬 탐색과 글로벌 안정성을 모두 포착하는 extit{구조적 엔트로피 extbf{R}anking 보상}입니다. 우리는 Deepseek-R1-Distill-Qwen-1.5B 모델에 이 방법을 구현하고 수학 도메인의 DeepscaleR-40k 데이터셋으로 학습했습니다. 실험 결과, 제안된 방법은 여러 수학 벤치마크에서 최첨단 성능을 달성했으며, 이는 기본 모델보다 상당한 성능 향상과 더 적은 추론 토큰 사용을 의미합니다. 코드와 모델은 공개될 예정입니다.

Original Abstract

Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!