2601.18779v1 Jan 26, 2026 cs.LG

POPE: 특권적 온라인 탐색을 통한 어려운 문제 해결 능력 학습

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

Amrith Rajagopal Setlur

Carnegie Mellon University

Citations: 1,102

h-index: 14

Yuxiao Qu

Citations: 349

h-index: 6

Virginia Smith

Citations: 294

h-index: 6

Ruslan Salakhutdinov

Citations: 473

h-index: 9

Aviral Kumar

Citations: 399

h-index: 7

강화 학습(RL)은 대규모 언어 모델(LLM)의 추론 능력을 향상시켰지만, 최첨단 방법조차도 여전히 많은 학습 문제에서 학습에 실패합니다. 어려운 문제에서 온라인 RL은 거의 하나의 올바른 실행 경로조차 탐색하지 못하여 보상이 0이 되고 개선을 위한 학습 신호가 전혀 발생하지 않습니다. 우리는 고전적인 RL에서 이 탐색 문제를 해결하기 위한 자연스러운 방법들, 예를 들어 엔트로피 보너스, 중요도 비율의 더 관대한 클리핑 또는 직접적인 pass@k 목표 최적화 등이 이 문제를 해결하지 못하며 종종 최적화를 불안정하게 만들고 해결 가능성을 향상시키지 않는다는 것을 발견했습니다. 자연스러운 대안은 더 쉬운 문제로부터의 전이(transfer)를 활용하는 것입니다. 그러나 우리는 RL 훈련 중에 쉬운 문제와 어려운 문제를 혼합하면 '레이 간섭(ray interference)'으로 인해 비생산적이라는 것을 보여줍니다. 즉, 최적화는 이미 해결 가능한 문제에 집중하여 더 어려운 문제에 대한 진행을 적극적으로 억제합니다. 이 문제를 해결하기 위해, 우리는 Privileged On-Policy Exploration (POPE)라는 접근 방식을 도입합니다. POPE는 인간 또는 다른 오라클 솔루션을 특권 정보로 활용하여 어려운 문제에 대한 탐색을 안내합니다. 이는 오라클 솔루션을 훈련 목표로 사용하는 방법(예: 오프라인 RL 방법 또는 SFT에서 시작)과 다릅니다. POPE는 어려운 문제에 오라클 솔루션의 접두사를 추가하여 RL이 가이드된 실행 경로 동안 0이 아닌 보상을 얻을 수 있도록 합니다. 중요한 점은, 결과적으로 나타나는 행동이 지침이 없는 원래 문제로 다시 전이되는데, 이는 지시 따르기와 추론 사이의 시너지 효과 덕분입니다. 실험적으로, POPE는 해결 가능한 문제의 범위를 확장하고 어려운 추론 벤치마크에서 성능을 크게 향상시킵니다.

Original Abstract

Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.

3 Citations

0 Influential

7 Altmetric

38.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!