2602.10885v1 Feb 11, 2026 cs.AI

자가 진화 루브릭을 활용한 생각의 사슬(Chain-of-Thought) 추론 강화

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Tat-Seng Chua

Citations: 2,957

h-index: 30

An Zhang

Citations: 131

h-index: 6

Xiang Wang

Citations: 715

h-index: 14

Leheng Sheng

Citations: 483

h-index: 10

Wenchang Ma

Citations: 261

h-index: 8

Ruixin Hong

Citations: 195

h-index: 8

생각의 사슬(CoT)이 거대 언어 모델(LLM) 추론에 중요한 역할을 함에도 불구하고, 이에 대해 직접적인 보상을 제공하는 것은 어렵습니다. 보상 모델을 훈련하려면 막대한 인간 레이블링 노력이 필요하며, 정적인 보상 모델(RM)은 변화하는 CoT 분포와 보상 해킹(reward hacking) 문제에 대응하기 어렵기 때문입니다. 이러한 난제들은 인간의 주석 작업 없이 점진적으로 진화할 수 있는 자율적인 CoT 보상 접근 방식을 모색하게 만들었습니다. 최근의 자가 진화 훈련 방법에서 영감을 받아, 본 논문에서는 스스로 제안하고 자가 진화하는 루브릭으로 CoT에 보상을 제공하여 결과 중심의 RLVR을 강화하는 RLCER(자가 진화 루브릭을 통한 CoT 감독 강화 학습)을 제안합니다. 실험 결과, 스스로 제안되고 자가 진화하는 루브릭은 결과 보상 없이도 신뢰할 수 있는 CoT 감독 신호를 제공하며, 이를 통해 RLCER이 결과 중심의 RLVR 성능을 능가함을 보였습니다. 또한, 이러한 자가 제안 루브릭을 프롬프트 내 힌트로 사용할 경우 추론 단계에서의 성능이 더욱 향상됨을 확인했습니다.

Original Abstract

Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

2 Citations

0 Influential

15 Altmetric

77.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!