2602.06107v1 Feb 05, 2026 cs.AI

Jackpot: 극심한 액터-정책 불일치 강화 학습을 위한 최적 예산 기각 샘플링

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Zhuo Chen

Citations: 99

h-index: 5

Hongyi Liu

Citations: 100

h-index: 3

Yang Zhou

Citations: 109

h-index: 3

Haizhong Zheng

Citations: 60

h-index: 3

Beidi Chen

Citations: 490

h-index: 7

대규모 언어 모델(LLM)을 위한 강화 학습(RL)은 특히 롤아웃(rollout) 비용이 높기 때문에 여전히 많은 비용이 소요됩니다. 롤아웃 생성을 정책 최적화와 분리하는 것(예: 더 효율적인 모델을 활용하여 롤아웃 수행)은 상당한 효율성 향상을 가능하게 할 수 있지만, 이는 학습을 불안정하게 만드는 심각한 분포 불일치를 야기합니다. 본 논문에서는 롤아웃 모델과 진화하는 정책 간의 차이를 직접적으로 줄이기 위해 최적 예산 기각 샘플링(OBRS)을 활용하는 프레임워크인 Jackpot을 제안합니다. Jackpot은 원칙에 입각한 OBRS 절차, 정책 및 롤아웃 모델을 공동으로 업데이트하는 통합 학습 목표, 그리고 Top-k 확률 추정 및 배치(batch) 수준 편향 보정을 통해 구현된 효율적인 시스템을 통합합니다. 이론적 분석에 따르면, OBRS는 제어 가능한 수락 예산 하에서 롤아웃 분포를 목표 분포에 지속적으로 근접시킵니다. 실증적으로 Jackpot은 중요도 샘플링(importance-sampling) 기준선과 비교하여 학습 안정성을 크게 향상시켰으며, Qwen3-8B-Base를 배치 크기 64로 최대 300 업데이트 단계까지 학습시킬 때 온폴리시(on-policy) RL에 버금가는 성능을 달성했습니다. 종합하면, 본 연구의 결과는 OBRS 기반 정렬이 LLM을 위한 RL에서 롤아웃 생성과 정책 최적화의 실용적이고 효과적인 분리에 한 걸음 더 다가서게 함을 보여줍니다.

Original Abstract

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!