2603.26535v1 Mar 27, 2026 cs.AI

분리된 어드밴티지 정규화를 통한 루브릭 통합 교육 안정화

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Shuyue Hu

Citations: 136

h-index: 7

Lei Bai

Citations: 284

h-index: 10

Zhenfei Yin

Citations: 280

h-index: 8

Zelin Tan

Citations: 122

h-index: 2

Zhouliang Yu

Citations: 234

h-index: 7

Bo-Cheng Lin

Citations: 14

h-index: 2

Zijie Geng

Citations: 257

h-index: 9

Hejia Geng

Citations: 16

h-index: 2

Mulei Zhang

Citations: 11

h-index: 1

Chen Zhang

Citations: 23

h-index: 2

Yudong Zhang

Citations: 4

h-index: 1

Yang Chen

Citations: 110

h-index: 1

본 연구에서는 기존 보상 설계의 두 가지 한계를 극복하기 위해, 분리된 어드밴티지 정규화를 통해 프로세스 수준의 평가를 그룹 상대 정책 최적화(GRPO)에 통합하는 방법인 프로세스 인식 정책 최적화(PAPO)를 제안합니다. 결과 보상 모델(ORM)은 최종 답변의 정확성만을 평가하며, 모든 정확한 답변을 동일하게 취급하여 추론 품질에 따른 차이를 반영하지 못하고, 그룹의 정확도가 높아짐에 따라 어드밴티지 신호가 점차 약화되는 문제가 있습니다. 프로세스 보상 모델(PRM)은 더욱 풍부한 정보를 제공하지만, PRM 점수를 직접 사용하는 경우 모델이 정확도를 희생하면서 과도한 답변을 생성하여 점수를 높이는 '보상 해킹' 현상이 발생합니다. PAPO는 ORM에서 파생되고 모든 답변에 대해 정규화된 결과 구성 요소 Aout과, 루브릭 기반 PRM에서 파생되고 정확한 답변에 대해서만 정규화된 프로세스 구성 요소 Aproc을 결합하여 이러한 문제를 해결합니다. 이러한 분리된 설계는 Aout이 정확성을 기반으로 학습을 고정시키는 반면, Aproc은 결과 신호를 왜곡하지 않고 추론 품질을 구별하는 데 사용됩니다. 다양한 모델 크기와 6개의 벤치마크를 사용한 실험 결과, PAPO는 ORM보다 일관되게 우수한 성능을 보였으며, OlympiadBench에서 51.3%의 성능을 달성하여 ORM의 46.3%를 능가했습니다. 또한 ORM의 성능이 정점에 도달하거나 감소하는 동안 PAPO는 지속적으로 성능을 향상시키는 것을 확인했습니다.

Original Abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!