2602.20532v1 Feb 24, 2026 cs.LG

액터-큐레이터: 정책 개선 방니트 알고리즘을 활용한 적응형 교육 과정 학습을 통한 강화 학습 후처리

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

Henry Peng Zou

University of Illinois Chicago

Citations: 635

h-index: 14

Wei Cheng

Citations: 66

h-index: 4

Zhengyao Gu

Citations: 31

h-index: 3

Jonathan Light

Citations: 63

h-index: 5

Raul Astudillo

Citations: 95

h-index: 3

Ziyu Ye

Citations: 1

h-index: 1

Langzhou He

Citations: 63

h-index: 5

Santiago Paternain

Citations: 11

h-index: 1

Philip S. Yu

Citations: 298

h-index: 3

Yisong Yue

Citations: 19

h-index: 3

강화 학습을 사용하여 대규모 기초 모델을 후처리하는 것은 일반적으로 방대하고 이질적인 데이터 세트에 의존하며, 효과적인 교육 과정 학습은 매우 중요하면서도 어려운 과제입니다. 본 연구에서는 대규모 언어 모델(LLM)의 강화 학습 후처리를 위한 확장 가능하고 완전 자동화된 교육 과정 학습 프레임워크인 ACTOR-CURATOR를 제안합니다. ACTOR-CURATOR는 신경망 기반 큐레이터를 학습하여, 대규모 문제 풀이 데이터에서 예상되는 정책 성능 향상을 직접 최적화하면서 문제를 동적으로 선택합니다. 우리는 문제 선택을 비정상적인 확률적 방니트 문제로 공식화하고, 온라인 확률적 미러 하강법을 기반으로 유도된 명확한 손실 함수를 사용하며, 부분 피드백 환경에서 후회 제한을 보장합니다. 실험적으로, ACTOR-CURATOR는 다양한 어려운 추론 벤치마크에서 균일 샘플링 및 강력한 교육 과정 기준보다 일관되게 우수한 성능을 보이며, 개선된 학습 안정성과 효율성을 입증합니다. 특히, AIME2024에서 28.6%, ARC-1D에서 30.5%의 상대적 성능 향상을 달성했으며, 최대 80%의 속도 향상을 보였습니다. 이러한 결과는 ACTOR-CURATOR가 확장 가능한 LLM 후처리를 위한 강력하고 실용적인 접근 방식임을 시사합니다.

Original Abstract

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!