2601.19404v2 Jan 27, 2026 cs.AI

RPO: 부분적 추론 최적화를 통한 강화 학습 미세 조정

RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

Hongzhu Yi

Citations: 9

h-index: 2

Xinming Wang

Citations: 23

h-index: 3

Zhenghao Zhang

Citations: 12

h-index: 2

Tianyu Zong

Citations: 153

h-index: 5

Yuanxiang Wang

Citations: 9

h-index: 2

Jun Xie

Citations: 45

h-index: 4

Tao Yu

Citations: 8

h-index: 2

Hao Jin

Citations: 19

h-index: 3

Kaixin Xu

Citations: 30

h-index: 3

Jiahuan Chen

Citations: 6

h-index: 2

Yujia Yang

Citations: 6

h-index: 2

Zhenyu Guan

Citations: 4

h-index: 1

Jungang Xu

Citations: 11

h-index: 2

Feng Chen

Citations: 84

h-index: 5

Bingkang Shi

Citations: 43

h-index: 3

대규모 언어 모델 분야에서, 강화 학습 미세 조정 알고리즘은 입력 쿼리로부터 시작하여 완전한 추론 경로를 생성해야 하며, 이는 학습 과정의 rollout 단계에서 상당한 계산 비용을 발생시킵니다. 이 문제를 해결하기 위해, 우리는 추론 경로의 다양한 부분이 최종 결과의 정확성에 미치는 영향을 분석하고, 이러한 분석 결과를 바탕으로 부분적 추론 최적화를 통한 강화 학습 미세 조정 (Reinforcement Fine-Tuning with Partial Reasoning Optimization, RPO)이라는 플러그 앤 플레이 형태의 강화 학습 미세 조정 알고리즘을 제안합니다. 기존의 강화 학습 미세 조정 알고리즘이 전체 추론 경로를 생성하는 반면, RPO는 경험 캐시를 사용하여 추론 경로의 접미사를 생성하여 모델을 학습시킵니다. 학습 과정의 rollout 단계에서 RPO는 토큰 생성량을 약 95% 줄여 이론적인 시간 오버헤드를 크게 감소시킵니다. RPO는 전체 경로 기반 강화 학습 미세 조정 알고리즘과 비교하여 1.5B 모델의 학습 시간을 90% 단축하고, 7B 모델의 학습 시간을 72% 단축합니다. 또한, GRPO 및 DAPO와 같은 기존 알고리즘과 통합하여 사용될 수 있으며, 이를 통해 기존 알고리즘과 동등한 성능을 유지하면서 학습 속도를 향상시킬 수 있습니다. 저희의 코드는 https://github.com/yhz5613813/RPO 에서 공개적으로 이용 가능합니다.

Original Abstract

Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.

3 Citations

0 Influential

32.229550745277 Altmetric

164.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!