2602.14307v1 Feb 15, 2026 cs.AI

이해의 경계에서의 벤치마킹

Benchmarking at the Edge of Comprehension

Samuele Marro

Citations: 116

h-index: 5

Jialin Yu

Citations: 6

h-index: 2

Emanuele La Malfa

University of Oxford

Citations: 520

h-index: 12

Oishi Deb

Citations: 48

h-index: 3

Jiawei Li

Citations: 29

h-index: 3

Yibo Yang

Citations: 12

h-index: 3

Ebey Abraham

Citations: 46

h-index: 1

Sunando Sengupta

Citations: 993

h-index: 9

Eric Sommerlade

Citations: 879

h-index: 13

Michael Wooldridge

Citations: 138

h-index: 6

Philip Torr

Citations: 15

h-index: 2

최첨단 대형 언어 모델(LLM)들이 새로운 벤치마크가 공개되자마자 빠르게 이를 포화상태로 만듦에 따라, 벤치마킹 분야는 중대한 기로에 서 있습니다. 만약 최첨단 모델들이 지속적으로 발전한다면, 인간이 변별력 있는 과제를 생성하거나, 정확한 정답(ground-truth)을 제시하거나, 복잡한 해결책을 평가하는 일은 점차 더 어려워질 것입니다. 벤치마킹이 불가능해진다면 AI의 진보를 측정할 수 있는 우리의 능력 또한 위태로워집니다. 우리는 이러한 상황을 '포스트 이해(post-comprehension) 체제'라고 칭합니다. 본 연구에서는 인간의 완전한 이해가 불가능한 상황에서도 모델을 비교할 수 있도록 설계된 적대적 프레임워크인 '비판-탄력적 벤치마킹(Critique-Resilient Benchmarking)'을 제안합니다. 이 기법은 '비판-탄력적 정확성'이라는 개념에 기반을 두고 있는데, 이는 어떤 적대자(adversary)도 설득력 있게 반박하지 못한 답변을 정답으로 간주하는 것입니다. 기존 벤치마킹과 달리 인간은 제한된 검증자로서 국지적인 주장에만 집중하며, 이를 통해 과제 전체를 완전히 이해하지 못하더라도 평가의 무결성을 유지할 수 있습니다. 우리는 항목화된 이분 브래들리-테리(Bradley-Terry) 모델을 활용하여, 난제 해결 능력과 어렵지만 해결 가능한 문제 생성 능력에 따라 LLM의 순위를 공동으로 산정합니다. 8개의 최첨단 LLM을 대상으로 수학 도메인에서 본 방법론의 유효성을 입증하였으며, 결과 점수가 안정적이고 외부 역량 지표와 상관관계가 있음을 확인했습니다. 우리의 프레임워크는 벤치마킹을 인간이 최종 판정자로 참여하는 적대적 생성-평가 게임으로 재정립합니다.

Original Abstract

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!