2601.22548v3 Jan 30, 2026 cs.CL

LLM 평가 모델은 정말 자기애적인가? 자기 선호도 평가의 신뢰성 검증

Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

Narmeen Oozeer

Citations: 15

h-index: 2

Dani Roytburg

Citations: 2

h-index: 1

Matthew Bozoukov

Citations: 3

h-index: 1

Matthew Nguyen

Citations: 2

h-index: 1

Mackenzie Puig-Hall

Citations: 4

h-index: 1

최근 연구에 따르면, 대규모 언어 모델(LLM)이 판단자 역할을 수행할 때 자신의 출력 결과물을 선호하는 경향이 있는데, 이는 자동화된 사후 학습 및 평가 프로세스의 무결성을 저해할 수 있습니다. 그러나 이러한 평가 편향이 자기애 때문인지, 아니면 일반적인 실험적 오류 때문인지 구별하기는 어렵기 때문에 자기 선호 편향에 대한 측정값이 왜곡될 수 있습니다. 우리는 핵심적인 방법론적 오류를 발견했으며, 이를 통해 측정 오류를 최대 89.6%까지 줄일 수 있습니다. 구체적으로, LLM 평가 모델이 판단자로서 질문에 응답할 때, 만약 판단자가 그 질문에 대해 자체적으로 잘못된 답변을 제공했다면, 자기 선호적인 판단을 내릴 가능성이 높습니다. 이는 해당 응답 중 하나가 자신의 응답이든 아니든 상관없이 마찬가지입니다. 우리는 자기 선호 신호를 노이즈가 많은 출력 결과로부터 분리하기 위해 '평가자 품질 기준(Evaluator Quality Baseline)'을 도입했습니다. 이 기준은 판단자가 자신에게 잘못된 점수를 부여할 확률과 다른 모델의 잘못된 응답에 대해 잘못된 점수를 부여할 확률을 비교합니다. 37,448개의 질문에 대해 이 간단한 기준을 평가한 결과, 초기 연구 결과의 51%만이 통계적 유의성을 유지했습니다. 마지막으로, 우리는 LLM 판단자가 내리는 '쉬운' 질문과 '어려운' 질문에 대한 평가 점수의 엔트로피를 분석했습니다. 우리의 수정된 기준은 향후 자기 선호도에 대한 연구를 수행할 때 잠재적인 오류 데이터를 제거하여 가능하게 합니다. 더 넓은 관점에서, 이 연구는 판단자 편향 효과를 목록화하고 분리하는 데 기여하는 연구 분야에 추가됩니다.

Original Abstract

Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!