2601.19404v1 Jan 27, 2026 cs.AI

RPO: 부분 추론 최적화를 이용한 강화 미세 조정

RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

Hongzhu Yi

Citations: 9

h-index: 2

Xinming Wang

Citations: 23

h-index: 3

Zhenghao Zhang

Citations: 12

h-index: 2

Tianyu Zong

Citations: 153

h-index: 5

Yuanxiang Wang

Citations: 9

h-index: 2

Jun Xie

Citations: 45

h-index: 4

Tao Yu

Citations: 8

h-index: 2

Hao Jin

Citations: 19

h-index: 3

Kaixin Xu

Citations: 30

h-index: 3

Jiahuan Chen

Citations: 6

h-index: 2

Yujia Yang

Citations: 6

h-index: 2

Zhenyu Guan

Citations: 4

h-index: 1

Jungang Xu

Citations: 11

h-index: 2

Feng Chen

Citations: 84

h-index: 5

Bingkang Shi

Citations: 43

h-index: 3

대규모 언어 모델 영역에서 강화 미세 조정 알고리즘은 입력 쿼리부터 시작되는 전체 추론 경로를 생성해야 하므로, 훈련의 롤아웃 단계에서 상당한 계산 오버헤드가 발생합니다. 이 문제를 해결하기 위해 본 논문에서는 추론 경로의 각 구간이 최종 결과의 정확성에 미치는 영향을 분석하고, 이를 바탕으로 플러그 앤 플레이 방식의 강화 미세 조정 알고리즘인 '부분 추론 최적화를 이용한 강화 미세 조정(RPO)'을 제안합니다. 전체 추론 경로를 생성하는 기존의 강화 미세 조정 알고리즘과 달리, RPO는 경험 캐시(experience cache)를 활용하여 추론 경로의 뒷부분(suffix)만을 생성함으로써 모델을 훈련합니다. 훈련의 롤아웃 단계에서 RPO는 토큰 생성을 약 95% 줄여 이론적인 시간 오버헤드를 크게 낮춥니다. 전체 경로 강화 미세 조정 알고리즘과 비교했을 때, RPO는 1.5B 모델의 훈련 시간을 90%, 7B 모델의 훈련 시간을 72% 단축시켰습니다. 또한 GRPO나 DAPO와 같은 대표적인 알고리즘과 통합하여 기존 알고리즘과 대등한 성능을 유지하면서도 훈련 속도를 가속화할 수 있습니다. 코드는 https://github.com/yhz5613813/RPO 에 공개되어 있습니다.

Original Abstract

Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.

3 Citations

0 Influential

32.229550745277 Altmetric

164.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!