2602.13576v1 Feb 14, 2026 cs.CR

평가 기준(Rubric)이 공격 대상이 될 수 있다: LLM 평가 모델에서 발생하는 은밀한 선호도 변화

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

He Sun

Citations: 113

h-index: 4

Ruomeng Ding

Citations: 124

h-index: 1

Zhun Deng

Citations: 25

h-index: 3

Yifei Pang

Citations: 13

h-index: 2

Yizhong Wang

Citations: 46

h-index: 2

Zhiwei Wu

Citations: 340

h-index: 9

대규모 언어 모델(LLM)의 평가 및 정렬 과정에서 LLM 기반 평가 모델이 점점 더 많이 활용되는데, 이 모델들의 동작은 자연어 평가 기준에 의해 안내되며, 벤치마크를 통해 검증됩니다. 우리는 이 워크플로우에서 간과되어 왔던 취약점을 발견했으며, 이를 '평가 기준 유발 선호도 변화(Rubric-Induced Preference Drift, RIPD)'라고 명명했습니다. 평가 기준 수정이 벤치마크 검증을 통과하더라도, 여전히 대상 영역에서 평가 모델의 선호도에 체계적이고 일방적인 변화를 초래할 수 있습니다. 평가 기준은 고수준 의사 결정 인터페이스 역할을 하기 때문에, 이러한 변화는 겉보기에는 자연스럽고 기준을 유지하는 수정사항에서도 발생할 수 있으며, 집계 벤치마크 지표나 제한적인 개별 검토를 통해서는 탐지하기 어렵습니다. 또한, 우리는 이러한 취약점이 평가 기준 기반의 선호도 공격을 통해 악용될 수 있음을 보여줍니다. 즉, 벤치마크를 준수하는 평가 기준 수정사항이 대상 영역에서 고정된 인간 또는 신뢰할 수 있는 기준에서 벗어나도록 유도하여 체계적으로 RIPD를 유발하고, 대상 영역의 정확도를 최대 9.5% (도움) 및 27.9% (안전)까지 감소시킵니다. 이러한 평가 결과가 이후 학습 과정에서 선호도 레이블을 생성하는 데 사용될 경우, 유도된 편향이 정렬 파이프라인을 통해 전파되어 학습된 정책에 내재화됩니다. 이는 모델 동작에 지속적이고 체계적인 변화를 초래합니다. 전반적으로, 우리의 연구 결과는 평가 기준이 민감하고 조작 가능한 제어 인터페이스임을 강조하며, 평가자 신뢰도 외에도 시스템 수준의 정렬 위험이 존재함을 보여줍니다. 코드: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. 경고: 특정 부분에는 잠재적으로 유해한 내용이 포함되어 있을 수 있으며, 모든 독자에게 적합하지 않을 수 있습니다.

Original Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

0 Citations

0 Influential

32.547189562171 Altmetric

162.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!