2603.25133v1 Mar 26, 2026 cs.AI

RubricEval: LLM 심사 시스템의 지시사항 준수 능력 평가를 위한 척도 기반 메타 평가 벤치마크

RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

Qi He

Citations: 591

h-index: 10

Xuan Lin

Citations: 33

h-index: 3

Tian Pan

Citations: 21

h-index: 2

Wenyan Yang

Citations: 42

h-index: 4

Shisong Chen

Citations: 15

h-index: 2

Li Qi

Citations: 18

h-index: 2

Wanqing Xu

Citations: 26

h-index: 3

Hongwei Feng

Citations: 608

h-index: 12

Bo Xu

Citations: 29

h-index: 3

Yanghua Xiao

Citations: 47

h-index: 4

척도 기반 평가는 대규모 언어 모델(LLM)의 지시사항 준수 능력 평가에서 널리 사용되는 패러다임이 되었습니다. 그러나 이러한 척도 기반 평가의 신뢰성은 여전히 불분명하며, 이에 대한 메타 평가가 필요합니다. 기존의 메타 평가 연구는 주로 응답 수준에 초점을 맞추어, 척도 기반 평가의 핵심인 세밀한 판단 정확도를 평가하는 데 어려움이 있었습니다. 이러한 간극을 해소하기 위해, 우리는 RubricEval을 소개합니다. RubricEval 벤치마크는 다음과 같은 특징을 가집니다: (1) 지시사항 준수 능력 평가를 위한 최초의 척도 기반 메타 평가 벤치마크, (2) 다양한 범주와 모델 출처를 포괄하는 지시사항 및 응답 데이터, (3) 품질 관리가 이루어진 3,486개의 데이터셋, 그리고 심사자 성능을 보다 명확하게 구분하는 쉬운/어려운 데이터셋. 우리의 실험 결과, 척도 기반 판단은 아직 해결해야 할 과제가 많다는 것을 보여줍니다. 지시사항 준수 벤치마크에서 널리 사용되는 GPT-4o조차도 어려운 데이터셋에서 55.97%의 정확도에 그쳤습니다. 평가 패러다임 측면에서, 척도 기반 평가는 체크리스트 기반 평가보다 성능이 우수하며, 명시적인 추론은 정확도를 향상시키고, 두 가지를 함께 사용하면 심사자 간의 편차를 줄일 수 있습니다. 우리는 RubricEval의 척도 분류 체계를 통해 일반적인 실패 요인을 파악하고, 신뢰성 있는 지시사항 준수 평가를 위한 실질적인 통찰력을 제공합니다.

Original Abstract

Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.

8 Citations

0 Influential

6 Altmetric

38.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!