2602.14307v2 Feb 15, 2026 cs.AI

이해의 한계를 넘어서는 벤치마킹

Benchmarking at the Edge of Comprehension

Samuele Marro

Citations: 116

h-index: 5

Jialin Yu

Citations: 6

h-index: 2

Emanuele La Malfa

University of Oxford

Citations: 520

h-index: 12

Oishi Deb

Citations: 48

h-index: 3

Jiawei Li

Citations: 29

h-index: 3

Yibo Yang

Citations: 12

h-index: 3

Ebey Abraham

Citations: 46

h-index: 1

Sunando Sengupta

Citations: 993

h-index: 9

Eric Sommerlade

Citations: 879

h-index: 13

Michael Wooldridge

Citations: 138

h-index: 6

Philip Torr

Citations: 15

h-index: 2

최첨단 대규모 언어 모델(LLM)이 새로운 벤치마크가 발표되는 즉시 이를 압도하는 현상이 나타나면서, 벤치마킹 자체가 중요한 기로에 놓여 있습니다. 최첨단 모델이 계속 발전한다면, 인간이 차별화된 과제를 생성하고, 정확한 정답을 제공하거나, 복잡한 솔루션을 평가하기가 점점 더 어려워질 것입니다. 벤치마킹이 불가능해진다면, 인공지능 분야의 발전 정도를 측정하는 능력 자체가 위태로워질 수 있습니다. 우리는 이러한 상황을 '이해 후의 시대(post-comprehension regime)'라고 부릅니다. 본 연구에서는 인간의 완전한 이해가 어려운 경우에도 모델을 비교할 수 있도록 설계된 적대적 프레임워크인 '비판-저항 벤치마킹(Critique-Resilient Benchmarking)'을 제안합니다. 우리의 기술은 '비판-저항 정확성'이라는 개념에 기반합니다. 즉, 답변이 옳다고 판단되는 것은 어떤 적대자도 설득력 있게 반박하지 못하는 경우입니다. 기존 벤치마킹과는 달리, 인간은 제한적인 검증자로서 역할을 수행하며, 전체 과제를 완전히 이해하는 대신 특정 주장에 집중하여 평가의 무결성을 유지합니다. 항목별 이분형 브래들리-테리 모델을 사용하여, LLM이 어려운 과제를 해결하는 능력과 어려운 동시에 풀 수 있는 질문을 생성하는 능력을 동시에 평가하여 순위를 매깁니다. 우리는 8개의 최첨단 LLM을 대상으로 수학 분야에서 우리 방법의 효과를 입증했으며, 그 결과 얻어진 점수가 안정적이고 외부 능력 측정 지표와 상관관계를 보임을 확인했습니다. 우리의 프레임워크는 벤치마킹을 적대적인 생성-평가 게임으로 재구성하며, 인간은 최종 심판 역할을 수행합니다.

Original Abstract

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!