2601.22588v1 Jan 30, 2026 cs.CL

LLM을 평가자로 재고찰: 의미 용량 불균형을 활용한 소규모 언어 모델 기반 표현-평가자 패러다임

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Ming Li

Citations: 20

h-index: 3

Zhuochun Li

Citations: 113

h-index: 6

Yong Zhang

Citations: 515

h-index: 5

Yuelyu Ji

Citations: 96

h-index: 6

Yiming Zeng

Citations: 88

h-index: 7

Ning Cheng

Citations: 17

h-index: 2

Yun Zhu

Citations: 6

h-index: 1

Yanmeng Wang

Citations: 7

h-index: 1

Shaojun Wang

Citations: 27

h-index: 3

Jing Xiao

Citations: 41

h-index: 3

Daqing He

Citations: 110

h-index: 6

대규모 언어 모델(LLM)은 프롬프트를 통해 참조 없이 평가하는 데 널리 사용되지만, 이러한 "LLM-as-a-Judge" 패러다임은 비용이 많이 들고, 투명하지 않으며, 프롬프트 설계에 민감합니다. 본 연구에서는 작은 모델이 표면 생성 대신 내부 표현을 활용하여 효율적인 평가자 역할을 수행할 수 있는지 조사합니다. 우리는 일관된 경험적 패턴을 발견했습니다. 작은 모델은 생성 능력은 약하지만, 내부 상태에 풍부한 평가 신호를 포함하고 있습니다. 이는 다음과 같은 가설을 제시하게 합니다. 평가는 생성에 비해 훨씬 적은 의미 용량이 필요하며, 중간 표현을 통해 이루어질 수 있습니다. 즉, 평가는 반드시 대규모 생성 모델에 의존할 필요가 없으며, 대신 작은 모델의 잠재적 특징을 활용할 수 있습니다. 우리의 연구 결과는 LLM-as-a-Judge 패러다임에서 Representation-as-a-Judge 패러다임으로 전환하는 계기가 됩니다. 이는 프롬프트된 출력에 의존하는 대신 내부 모델 구조를 탐색하는 디코딩-프리 평가 전략입니다. 우리는 INSPECTOR라는 프레임워크를 통해 이러한 패러다임을 구현했습니다. INSPECTOR는 소규모 모델의 표현으로부터 세부 측면별 평가 점수를 예측하는 탐색 기반 프레임워크입니다. 추론 벤치마크(GSM8K, MATH, GPQA)에 대한 실험 결과, INSPECTOR는 프롬프트를 기반으로 하는 작은 모델보다 훨씬 우수한 성능을 보이며, 전체 LLM 평가자와 유사한 결과를 얻으면서도 더욱 효율적이고, 신뢰할 수 있으며, 해석 가능한 대규모 평가를 위한 대안을 제공합니다.

Original Abstract

Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!