2602.03143v1 Feb 03, 2026 cs.LG

자기 참조 언어 모델이 강화 학습 성능을 향상시킨다

Self-Hinting Language Models Enhance Reinforcement Learning

Jiang Bian

Citations: 52

h-index: 4

Baohao Liao

Citations: 408

h-index: 11

Hanze Dong

Citations: 83

h-index: 6

Xinxing Xu

Citations: 29

h-index: 3

C. Monz

Citations: 2,326

h-index: 18

최근, Group Relative Policy Optimization (GRPO)은 대규모 언어 모델을 검증 가능한 목표에 맞추는 실용적인 방법으로 부상했습니다. 그러나 희소한 종료 보상 환경에서는 GRPO가 종종 정체되는 경향이 있는데, 이는 그룹 내의 실행 과정에서 동일한 보상을 받는 경우가 많아 상대적인 이점이 무너지고 업데이트가 사라지기 때문입니다. 본 논문에서는 특권적인 감독 신호를 활용하여 실행 과정의 분포를 재구성하는 on-policy 강화 학습 프레임워크인 self-hint aligned GRPO with privileged supervision (SAGE)를 제안합니다. 각 프롬프트 $x$에 대해, 모델은 간결한 힌트 $h$ (예: 계획 또는 분해)를 샘플링하고, $(x,h)$ 조건으로 솔루션 $τ$을 생성합니다. 중요한 점은, 작업 보상 $R(x,τ)$은 변경되지 않으며, 힌트는 유한한 샘플링 하에서 그룹 내의 결과 다양성을 증가시켜 희소한 보상 환경에서 GRPO의 이점이 무너지는 현상을 방지합니다. 테스트 시에는 $h= ext{∅}$으로 설정하여 힌트가 없는 정책을 사용할 수 있습니다. 또한, 다양한 자기 참조 힌트를 샘플링하는 것은 학습자의 약점을 더 효과적으로 추적하는 적응형 커리큘럼 역할을 하며, 이는 초기 정책이나 더 강력한 외부 모델에서 제공하는 고정된 힌트보다 우수합니다. 3개의 LLM을 사용하여 6개의 벤치마크에서 수행한 실험 결과, SAGE는 GRPO보다 일관되게 더 나은 성능을 보였으며, 평균적으로 Llama-3.2-3B-Instruct에서 +2.0, Qwen2.5-7B-Instruct에서 +1.2, Qwen3-4B-Instruct에서 +1.3의 성능 향상을 보였습니다. 코드 및 추가 정보는 https://github.com/BaohaoLiao/SAGE 에서 확인할 수 있습니다.

Original Abstract

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

10 Citations

3 Influential

43.451858789481 Altmetric

233.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!