2604.08905v1 Apr 10, 2026 cs.AI

StaRPO: 안정성 강화 강화 학습 정책 최적화

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Kunpeng Liu

Citations: 935

h-index: 18

Yanjie Fu

Citations: 377

h-index: 12

Jinghan Zhang

Citations: 46

h-index: 2

Tharindu Cyril Weerasooriya

Rochester Institute of Technology

Citations: 133

h-index: 7

Fengran Mo

Citations: 59

h-index: 4

R. Dai

Citations: 3

h-index: 1

Dakuo Wang

Citations: 87

h-index: 6

Xiaoyang Han

Citations: 41

h-index: 3

강화 학습(RL)은 복잡한 추론 작업에서 대규모 언어 모델의 정확성을 향상시키는 데 효과적입니다. 기존의 강화 학습 정책 최적화 프레임워크는 최종 답변의 정확성을 피드백 신호로 사용하며, 추론 과정의 내부 논리적 구조를 제대로 반영하지 못하는 경우가 많습니다. 그 결과, 모델은 유창하고 의미적으로 관련성이 높은 응답을 생성하지만 논리적으로 일관성이 없거나, 구조적으로 비정상적이거나, 중복되는 경향이 있습니다. 이에, 우리는 추론의 안정성을 최적화 목표에 명시적으로 통합하는 안정성 강화 강화 학습 프레임워크인 StaRPO를 제안합니다. StaRPO는 안정성을 두 가지 계산 가능한 경량 지표로 분해합니다. 첫째, ACF(Autocorrelation Function, 자기상관 함수)는 단계별 일관성을 평가하고, 둘째, PE(Path Efficiency, 경로 효율성)는 추론 경로의 전반적인 목표 지향성을 평가합니다. 이러한 안정성 보상은 작업 보상과 결합되어 상호 보완적이고 과정 인지적인 피드백을 제공합니다. ACF 및 PE 보상을 사용하는 것이 논리 오류와 어떻게 상관관계가 있는지 두 가지 기본 모델을 통해 검증하여 효과를 입증했습니다. 네 가지 추론 벤치마크에 대한 실험 결과, StaRPO는 기존 방법보다 일관되게 우수한 성능을 보이며, 최종 답변의 정확성과 논리적 안정성을 모두 향상시킬 수 있습니다.

Original Abstract

Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

3 Citations

0 Influential

9 Altmetric

48.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!