2605.03858v1 May 05, 2026 cs.CL

MCJudgeBench: 다중 제약 조건 지시 사항 준수에서 제약 조건 수준의 평가를 위한 벤치마크

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Junyoung Koh

Yonsei University

Citations: 23

h-index: 3

Z. Tok

Citations: 2

h-index: 1

Hunar Batra

Citations: 115

h-index: 4

Ronald Clark

Citations: 27

h-index: 2

Jaeyun Lee

Citations: 32

h-index: 2

다중 제약 조건 지시 사항 준수는 응답이 여러 개별 요구 사항을 충족하는지 확인해야 하지만, LLM 평가 모델은 종종 전체 응답에 대한 판단만으로 평가됩니다. 본 논문에서는 다중 제약 조건 지시 사항 준수에서 제약 조건 수준의 평가를 위한 벤치마크인 MCJudgeBench를 소개합니다. 각 데이터는 지시 사항, 후보 응답, 명시적인 제약 조건 목록, 각 제약 조건에 대한 {yes, partial, no} 형태의 정답 레이블, 그리고 응답 측면의 제어된 변형으로 구성됩니다. 평가 프로토콜은 또한 평가 프롬프트의 다양한 변형을 포함하여 평가 모델의 안정성을 테스트합니다. 우리는 정확성과 불일치성 지표를 모두 사용하여 독점 및 오픈 소스 LLM 평가 모델을 평가하고, 확률적 디코딩 하에서의 내재적 불일치와 프롬프트 및 응답 변형 하에서의 절차적 불일치를 구별합니다. 우리의 결과는 평가 모델의 신뢰성이 여러 측면을 갖는다는 것을 보여줍니다. 즉, 전체적으로 높은 성능이 모든 레이블 범주에서 동등하게 신뢰할 수 있는 감지를 보장하지 않으며, 특히 빈도가 낮은 'partial' 및 'no' 사례에서 더욱 그렇습니다. 정확도가 더 높은 평가 모델이 항상 불일치가 더 낮지는 않습니다. 추론을 사용한 평가는 정확도를 향상시키지만, 안정성을 균일하게 향상시키지는 않습니다. 이러한 결과는 LLM 평가 모델의 실패 사례를 연구하기 위해 제약 조건 수준에서 평가해야 함을 시사합니다.

Original Abstract

Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!