2604.10701v1 Apr 12, 2026 cs.LG

가치 모델 재조명: LLM 강화 학습에서의 가치 모델링을 위한 생성적 비평기

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Zikang Shan

Citations: 214

h-index: 2

Han Zhong

Citations: 133

h-index: 2

Liwei Wang

Citations: 143

h-index: 3

Li Zhao

Citations: 192

h-index: 5

강화 학습(RL)에서 보상 할당은 핵심적인 과제입니다. 기존의 액터-크리틱 방법은 학습된 가치 함수를 기반으로 세밀한 이점 추정을 통해 이 문제를 해결합니다. 그러나 현대의 대규모 언어 모델(LLM) RL에서는 학습된 가치 모델이 종종 사용되지 않는데, 이는 기존의 판별적 크리틱을 안정적으로 훈련시키는 것이 어렵기 때문입니다. 본 연구에서는 가치 모델링을 재검토하고, 이러한 어려움이 제한된 표현력 때문일 수 있다고 주장합니다. 특히, 표현 복잡성 이론에 따르면, 기존 가치 모델에서 사용되는 일회 예측 방식 하에서 가치 함수를 근사하는 것이 어려울 수 있으며, 우리의 확장 실험 결과는 이러한 크리틱이 규모가 커져도 안정적으로 성능 향상을 보이지 않는다는 것을 보여줍니다. 이러한 관찰에 따라, 우리는 일회 스칼라 가치 예측을 체인 오브 소트(Chain-of-Thought) 추론을 수행하는 생성적 크리틱으로 대체하는 Generative Actor-Critic (GenAC)을 제안합니다. 또한, 훈련 과정에서 크리틱이 현재 액터에 맞춰 정확하게 조정되도록 돕는 인-컨텍스트 컨디셔닝(In-Context Conditioning)을 도입합니다. GenAC은 가치 근사, 순위 신뢰성 및 일반화 성능을 향상시키며, 이러한 개선 사항은 가치 기반 및 가치 없는 기준 모델보다 우수한 하위 RL 성능으로 이어집니다. 전반적으로, 본 연구의 결과는 더 강력한 가치 모델링이 LLM 강화 학습에서의 보상 할당을 개선하는 유망한 방향임을 시사합니다.

Original Abstract

Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!