2603.08035v1 Mar 09, 2026 cs.AI

CDRRM: 대비 기반 척도 생성 모델을 활용한 신뢰성 있고 해석 가능한 보상 모델링

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

Y. Ban

Citations: 95

h-index: 6

Guojun Yin

Citations: 202

h-index: 7

Wei Lin

Citations: 195

h-index: 7

Xiaohan Wang

Citations: 64

h-index: 4

Jiajun Chai

Citations: 79

h-index: 4

Dengcan Liu

Citations: 5

h-index: 2

Feng Yang

Citations: 259

h-index: 5

Shurui Yan

Citations: 35

h-index: 3

Jiahao Li

Citations: 12

h-index: 2

Zhendong Mao

Citations: 193

h-index: 6

보상 모델링은 대규모 언어 모델(LLM)을 인간의 선호도에 맞추는 데 필수적이지만, 기존의 보상 모델은 해석 가능성이 낮고 비용이 많이 드는 전문가 주석에 크게 의존하는 문제가 있습니다. 최근의 척도 기반 접근 방식은 평가의 투명성을 높이지만, 체계적인 품질 관리가 부족하여 노이즈가 많고 중복된 기준을 생성하고, LLM 평가자에게서 나타나는 지속적인 편향(예: 장황함, 위치)을 완화하지 못하며, 확장성과 신뢰성 간의 균형을 맞추기 어렵습니다. 이러한 한계를 해결하기 위해, 우리는 고품질 척도 생성과 안내된 선호도 판단을 위한 새로운 대비-합성 패러다임에 기반한 프레임워크인 CDRRM(Contrast-Driven Rubric Reward Model)을 제안합니다. CDRRM은 먼저 선호도 쌍에 대한 다차원 대비 분석을 수행하여 인과적 차별 요인을 식별하고, 이러한 통찰력을 간결하고 문맥에 맞는 척도로 합성하여 선호도 판단을 안내합니다. 세 가지 권위 있는 벤치마크(RewardBench, RMBench, RMB)에 대한 광범위한 실험 결과, CDRRM은 다양한 도메인에서 최첨단 성능을 달성하고 위에서 언급한 평가 편향을 효과적으로 완화합니다. 특히, 당사의 접근 방식은 뛰어난 데이터 효율성을 제공합니다. 3,000개의 고품질 샘플로 척도 생성기를 학습시키면, 사전 훈련된 판별 모델이 전체적으로 미세 조정된 기준 모델보다 뛰어난 성능을 발휘합니다. 이 연구는 보상 모델링을 위한 확장 가능하고, 해석 가능하며, 데이터 효율적인 경로를 제시합니다.

Original Abstract

Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annotations. While recent rubric-based approaches enhance evaluation transparency, they lack systematic quality control, yielding noisy and redundant criteria, failing to mitigate persistent biases (e.g., verbosity, position) in LLM evaluators, and creating a scalability-reliability trade-off. To address these limitations, we propose CDRRM (Contrast-Driven Rubric Reward Model), a framework built on a novel Contrast-then-Synthesis paradigm for high-quality rubric generation and guided preference judgment. CDRRM first conducts multi-dimensional contrastive profiling on preference pairs to identify causal discriminative factors, then synthesizes these insights into compact, context-aware rubrics to guide preference judg- ments. Extensive experiments on three authoritative benchmarks (RewardBench, RMBench, RMB) demonstrate that CDRRM achieves state-of-the-art performance across diverse domains and effectively mitigates aforementioned evaluation biases. Notably, our approach delivers exceptional data efficiency: training the rubric generator on only 3k high-quality samples empowers a frozen pre-trained judge model to outperform fully fine-tuned baselines. This work offers a scalable, interpretable, and data-efficient path for reward modeling.

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!