2602.00815v1 Jan 31, 2026 cs.AI

동적 원샷 정책 정제를 통한 추론형 거대 언어 모델의 자원 효율적 강화 학습

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

Yunjian Zhang

Citations: 64

h-index: 5

Sudong Wang

Citations: 68

h-index: 2

Jianing Li

Citations: 48

h-index: 3

Peiran Xu

Citations: 13

h-index: 1

Yao Zhu

Citations: 47

h-index: 4

Yang Li

Citations: 3,005

h-index: 8

Cong Zhou

Citations: 145

h-index: 8

Xiaoyu Ma

Citations: 8

h-index: 1

거대 언어 모델(LLM)은 복잡한 추론 작업에서 놀라운 성능을 보여주었으며, 검증 가능한 보상 하의 강화 학습(RLVR)이 모델의 행동을 추론 사슬과 정렬하는 원칙적인 프레임워크로 부상했습니다. 그 잠재력에도 불구하고, RLVR은 광범위한 보상 신호를 요구하고 훈련 중 상당한 롤아웃 비용을 발생시켜 여전히 자원 소모가 극심합니다. 본 연구에서는 RLVR의 데이터 및 연산 효율성에 대한 근본적인 문제를 재조명합니다. 먼저 추론 능력을 이끌어내는 데 필요한 표본 복잡도의 이론적 하한을 확립하고, 놀라울 정도로 적은 수의 훈련 예시만으로도 강력한 성능을 달성할 수 있음을 실증적으로 검증합니다. 이러한 계산 부담을 해결하기 위해, 우리는 보상 변동성과 탐색 주도적 획득을 지침 삼아 정책 업데이트를 위해 배치당 단 하나의 유용한 훈련 샘플을 동적으로 선택하는 불확실성 인식 RL 전략인 동적 원샷 정책 정제(DoPR)를 제안합니다. DoPR은 경쟁력 있는 추론 정확도를 유지하면서 롤아웃 오버헤드를 거의 10분의 1로 줄여주며, LLM 사후 학습을 위한 확장 가능하고 자원 효율적인 솔루션을 제공합니다. 이 접근법은 추론 집약적 LLM 애플리케이션을 위한 보다 효율적이고 접근 가능한 RL 기반 훈련으로 나아가는 실용적인 경로를 제시합니다.

Original Abstract

Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!