2602.20670v1 Feb 24, 2026 cs.CL

CAMEL: 신뢰도 기반 반사 학습을 통한 보상 모델링

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu

Citations: 1

h-index: 1

Yang Luo

Citations: 240

h-index: 7

Kanchan Sarkar

Citations: 23

h-index: 1

Kun Xu

Citations: 55

h-index: 4

Yang You

Citations: 93

h-index: 4

Hailun Xu

Citations: 23

h-index: 1

Yong Liu

Citations: 50

h-index: 4

보상 모델은 대규모 언어 모델을 인간의 선호도에 맞추는 데 중요한 역할을 합니다. 기존 방법은 주로 두 가지 패러다임을 따릅니다. 하나는 효율적이지만 해석력이 부족한 스칼라 판별적 선호도 모델이고, 다른 하나는 풍부한 추론을 제공하지만 계산 비용이 높은 생성적 판단 모델입니다. 우리는 판정 토큰 간의 로그 확률 차이가 예측 정확도와 밀접한 관련이 있으며, 추가적인 추론 비용 없이도 인스턴스 난이도를 나타내는 신뢰할 수 있는 지표가 될 수 있다는 점을 발견했습니다. 이러한 통찰력을 바탕으로, 우리는 경량화된 단일 토큰 기반 선호도 결정을 먼저 수행하고, 신뢰도가 낮은 인스턴스에 대해서만 반사 학습을 선택적으로 적용하는 신뢰도 기반 반사 학습 프레임워크인 CAMEL을 제안합니다. 효과적인 자기 수정 기능을 유도하기 위해, 우리는 모델을 강화 학습을 통해 훈련하며, 반사실적 접두사 증강 기법을 사용하여 모델이 다양한 초기 판정에 노출되도록 하여 진정한 수정이 이루어지도록 합니다. 실험적으로, CAMEL은 널리 사용되는 세 가지 보상 모델 벤치마크에서 최첨단 성능을 달성했으며, 평균 정확도가 82.9%로, 기존 최고 모델보다 3.2% 향상되었습니다. 또한, 70B 파라미터 모델보다 뛰어난 성능을 보이면서도 14B 파라미터만 사용하여 정확도와 효율성 간의 더욱 우수한 균형을 제공합니다.

Original Abstract

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!