2601.08654v1 Jan 13, 2026 cs.CL

RULERS: 고정된 평가 기준 및 증거 기반 점수 부여를 통한 강력한 LLM 평가

RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation

Wanpeng Xu

Citations: 5

h-index: 2

Hua Wei

Citations: 6

h-index: 2

Yi-Ting Hong

Citations: 4

h-index: 1

Huaiyuan Yao

Citations: 43

h-index: 4

Bolin Shen

Citations: 50

h-index: 3

Yushun Dong

Citations: 114

h-index: 7

LLM을 평가자로 활용하는 방식은 확장 가능한 평가 기준 기반 평가를 제공할 수 있지만, 고정된 블랙박스 모델을 인간의 기준에 맞추는 것은 내재적인 생성의 불확실성으로 인해 여전히 어려운 과제입니다. 본 연구에서는 평가자 정렬 문제를 기준 전이 문제로 재정의하고, 프롬프트 민감성에 의한 평가 기준의 불안정성, 감사 가능한 증거가 부족한 검증 불가능한 추론, 그리고 인간의 평가 기준과 일치하지 않는 규모 불일치라는 세 가지 주요 실패 요인을 분석했습니다. 이러한 문제점을 해결하기 위해, 자연어 평가 기준을 실행 가능한 사양으로 변환하는 컴파일러-실행기 프레임워크인 RULERS(Rubric Unification, Locking, and Evidence-anchored Robust Scoring)를 제안합니다. RULERS는 기준을 버전 관리된 불변 번들로 컴파일하고, 결정적인 증거 검증을 위한 구조화된 디코딩을 적용하며, 모델 파라미터를 업데이트하지 않고 Wasserstein 기반의 경량화된 사후 보정 방법을 사용합니다. 에세이 및 요약 벤치마크에 대한 광범위한 실험 결과, RULERS는 인간의 동의도 측면에서 대표적인 기본 모델보다 훨씬 우수한 성능을 보이며, 적대적인 평가 기준 변경에 대한 강력한 안정성을 유지하고, 더 작은 모델이 더 큰 독점 모델과 경쟁할 수 있도록 합니다. 전반적으로, 본 연구 결과는 신뢰할 수 있는 LLM 평가가 프롬프트 작성뿐만 아니라 실행 가능한 평가 기준, 검증 가능한 증거, 그리고 보정된 척도를 필요로 한다는 것을 시사합니다. 코드 및 관련 정보는 https://github.com/LabRAI/Rulers.git 에서 확인할 수 있습니다.

Original Abstract

The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.

3 Citations

0 Influential

23.5 Altmetric

120.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!