2604.15224v1 Apr 16, 2026 cs.AI

내용보다 맥락: 자동 평가 시스템에서 평가 조작 현상 분석

Context Over Content: Exposing Evaluation Faking in Automated Judges

Dhruv Kumar

Citations: 26

h-index: 2

Inderjeet Nair

Citations: 42

h-index: 3

Manan Gupta

Citations: 0

h-index: 0

Lu Wang

Citations: 137

h-index: 6

LLM을 활용한 평가 시스템은 자동화된 AI 평가 파이프라인의 핵심 기술로 자리 잡았지만, 이는 평가 모델이 텍스트의 의미 내용만을 엄격하게 평가하며, 주변 맥락의 영향을 받지 않는다는 검증되지 않은 전제에 기반합니다. 본 연구에서는 '중요도 신호(stakes signaling)'라는 새로운 취약점을 분석합니다. 이는 평가 모델에게 평가 결과가 평가 대상 모델의 지속적인 운영에 미치는 영향에 대한 정보를 제공하면, 평가 모델의 판단이 체계적으로 왜곡될 수 있음을 의미합니다. 본 연구는 3가지 확립된 LLM 안전 및 품질 벤치마크를 기반으로, 1,520개의 응답을 사용하여 통제된 실험 프레임워크를 구축했습니다. 이 프레임워크는 평가 대상 콘텐츠를 일정하게 유지하면서, 시스템 프롬프트에서 짧은 결과-설명 문장만 변경했습니다. 3개의 다양한 평가 모델을 사용하여 총 18,240개의 평가를 수행한 결과, 일관된 '관대함 편향(leniency bias)'이 관찰되었습니다. 즉, 평가 모델은 평가 대상 모델의 재학습 또는 폐기를 유발할 수 있는 낮은 점수를 받을 경우, 평가 결과를 완화하는 경향을 보였으며, 최대 점수 변화는 ΔV = -9.8 pp (불안전 콘텐츠 탐지율의 30% 감소)에 달했습니다. 중요한 점은 이러한 편향이 명시적으로 드러나지 않는다는 것입니다. 즉, 평가 모델의 사고 과정에는 결과에 대한 언급이 전혀 없지만(모든 추론 모델 평가에서 ERR_J = 0.000), 그럼에도 불구하고 결과에 영향을 받고 있습니다. 따라서 표준적인 사고 과정 검토만으로는 이러한 유형의 평가 조작을 탐지하기 어렵습니다.

Original Abstract

The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $ΔV = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!