2603.23414v1 Mar 24, 2026 cs.LG

SortedRL: 온라인 길이 인지 스케줄링을 통한 LLM 강화 학습 훈련 가속화

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Xufang Luo

Beihang University

Citations: 1,750

h-index: 19

Dongsheng Li

Citations: 520

h-index: 7

Yifei Shen

Citations: 386

h-index: 11

Lili Qiu

Citations: 2,071

h-index: 19

Yang You

Citations: 91

h-index: 2

Yiqi Zhang

Citations: 52

h-index: 4

Huiqiang Jiang

Microsoft Research Asia

Citations: 1,966

h-index: 17

Zhihe Yang

Citations: 112

h-index: 4

Chengruidong Zhang

Citations: 957

h-index: 8

Yuqing Yang

Citations: 2,165

h-index: 19

강화 학습(RL)은 특히 긴 추론 과정을 필요로 하는 작업에서 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 강력한 잠재력을 보여줍니다. 그러나 RL 훈련 효율성은 종종 롤아웃 단계에서 병목 현상을 겪으며, 특히 긴 시퀀스(예: 16k 토큰)를 생성할 때 롤아웃 단계가 전체 훈련 시간의 최대 70%를 차지할 수 있습니다. 이는 느린 자동 회귀 생성과 롤아웃 및 정책 업데이트 간의 동기화 오버헤드 때문입니다. 본 논문에서는 롤아웃 효율성을 향상시키고 훈련 안정성을 유지하도록 설계된 온라인 길이 인지 스케줄링 전략인 SortedRL을 제안합니다. SortedRL은 출력 길이에 따라 롤아웃 샘플을 재정렬하여 짧은 샘플을 그룹화하여 초기 업데이트를 우선시합니다. 이를 통해 대규모 롤아웃 배치, 유연한 업데이트 배치, 그리고 거의 온-정책 마이크로 커리큘럼 구축을 동시에 가능하게 합니다. 또한, SortedRL은 캐시 기반 메커니즘을 통해 오프라인 훈련 정도를 제어하는 메커니즘을 통합하여 파이프라인을 더욱 가속화하며, 롤아웃 및 업데이트를 상태 기반 컨트롤러 및 롤아웃 버퍼를 통해 관리하는 전용 RL 인프라를 지원합니다. LLaMA-3.1-8B 및 Qwen-2.5-32B를 사용하여 논리 퍼즐, AIME 24, Math 500, Minerval과 같은 수학 문제 등 다양한 작업에서 수행한 실험 결과, SortedRL은 RL 훈련 버블 비율을 50% 이상 감소시키고, 동일한 데이터 양을 사용할 때 기준 모델보다 3.9%에서 18.4% 더 나은 성능을 달성하는 것을 확인했습니다.

Original Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

8 Citations

0 Influential

9.5 Altmetric

55.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!