2601.04731v1 Jan 08, 2026 cs.AI

Miner: 대규모 추론 모델의 데이터 효율적 강화학습을 위한 내재적 숙련도 마이닝

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Shuyang Jiang

Citations: 244

h-index: 8

Yuhao Wang

Citations: 199

h-index: 7

Y. Zhang

Citations: 16

h-index: 3

Yanfeng Wang

Citations: 363

h-index: 11

Yu Wang

Citations: 3

h-index: 1

대규모 추론 모델을 위한 기존의 Critic-free 강화학습 방법은 긍정적 동질성 프롬프트(모든 롤아웃이 정답인 경우)로 학습할 때 심각한 비효율성을 겪으며, 이는 어드밴티지(advantage) 추정값이 0이 되어 롤아웃이 낭비되는 결과를 초래합니다. 우리는 외부 감독, 보조 모델, 혹은 추가적인 추론 비용 없이 정책의 내재적 불확실성을 자기지도(self-supervised) 보상 신호로 재사용하여 '내재적 숙련도를 마이닝(Mine intrinsic mastery, Miner)'하는 매우 간단하면서도 강력한 솔루션을 제안합니다. 우리의 방법은 두 가지 핵심 혁신을 도입합니다. (1) 결정적으로 불확실한 토큰에 대한 기울기를 동적으로 증폭시키고 과신하는 토큰은 억제하는 토큰 수준의 초점 신용 할당(focal credit assignment) 메커니즘, (2) 내재적 보상과 검증 가능한 보상을 매끄럽게 통합하는 적응형 어드밴티지 보정(adaptive advantage calibration)입니다. Qwen3-4B 및 Qwen3-8B 베이스 모델을 사용하여 6가지 추론 벤치마크에서 평가한 결과, Miner는 비교 대상인 4가지 알고리즘 중 최고 성능(SOTA)을 달성했으며, GRPO 대비 Pass@1에서 최대 4.58점, Pass@K에서 6.66점의 절대적인 성능 향상을 기록했습니다. 탐색 강화를 목표로 하는 다른 방법들과의 비교 또한 제안된 두 가지 혁신의 우수성을 입증했습니다. 이는 잠재적 불확실성의 활용이 추론 모델의 효율적이고 확장 가능한 강화학습 훈련을 위한 필요충분조건임을 보여줍니다.

Original Abstract

Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!