2603.25184v1 Mar 26, 2026 cs.LG

움직이는 경계에서 학습: 효율적인 대규모 추론 모델 강화 학습을 위한 온라인 검증 프롬프트 선택

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

Kun Wang

Citations: 66

h-index: 4

Jiahao Wu

Citations: 50

h-index: 3

Ning Lu

Hong Kong University of Science and Technology

Citations: 198

h-index: 8

Yanting Yang

Citations: 258

h-index: 4

Li Qing

Citations: 2

h-index: 1

Ke Tang

Citations: 292

h-index: 7

Shengcai Liu

Citations: 889

h-index: 15

강화 학습(RL)은 추론 작업에서 대규모 언어 모델(LLM)의 추가 학습에 필수적인 기술이 되었습니다. 롤아웃을 확장하면 학습을 안정화하고 성능을 향상시킬 수 있지만, 계산 비용이 중요한 문제입니다. GRPO와 같은 알고리즘에서 프롬프트당 여러 롤아웃은 상당한 비용을 초래하는데, 이는 많은 프롬프트가 미미한 기울기를 제공하여 유용성이 낮기 때문입니다. 이러한 문제를 해결하기 위해, 롤아웃 단계 전에 높은 유용성을 가진 프롬프트를 선택하는 방법을 연구합니다. 우리의 실험적 분석 결과, 샘플 유용성이 균일하지 않고 변화하며, 강한 학습 신호는 중간 난이도와 높은 불확실성이 교차하는 '학습 경계'에 집중되어 있으며, 학습이 진행됨에 따라 이 경계는 이동합니다. 이러한 점에 착안하여, 데이터 효율적인 강화 학습을 위한 이중 단계 프레임워크인 HIVE(History-Informed and online-VErified prompt selection, 과거 정보를 활용하고 온라인으로 검증하는 프롬프트 선택)를 제안합니다. HIVE는 과거의 보상 경로를 사용하여 대략적인 선택을 수행하고, 프롬프트 엔트로피를 실시간 프록시로 사용하여 더 이상 유용성이 없는 인스턴스를 제거합니다. 다양한 수학 추론 벤치마크 및 모델에서 HIVE를 평가한 결과, HIVE는 성능을 저하시키지 않으면서 상당한 롤아웃 효율성을 제공한다는 것을 보여줍니다.

Original Abstract

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

1 Citations

0 Influential

7.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!