2602.16763v1 Feb 18, 2026 cs.AI

AI 벤치마크가 정체될 때: 벤치마크 포화 현상에 대한 체계적인 연구

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Leshem Choshen

Citations: 892

h-index: 14

Mrinmaya Sachan

Citations: 5,569

h-index: 41

M. Kochenderfer

Citations: 2,039

h-index: 24

S. Pawar

Citations: 141

h-index: 4

Mubashara Akhtar

Citations: 285

h-index: 9

Anka Reuel

Citations: 1,106

h-index: 14

Prajna Soni

Citations: 17

h-index: 2

Sanchit Ahuja

Citations: 206

h-index: 6

Pawan Sasanka Ammanamanchi

Citations: 3,559

h-index: 8

Ruchit Rawal

Citations: 189

h-index: 5

Vilém Zouhar

Citations: 82

h-index: 5

Chenxi Whitehouse

Citations: 717

h-index: 12

Dayeon Ki

Citations: 507

h-index: 5

Jennifer Mickel

Citations: 194

h-index: 4

Marek vSuppa

Citations: 150

h-index: 2

Jan Batzner

Citations: 37

h-index: 4

Jenny Chim

Citations: 5

h-index: 2

Jeba Sania

Citations: 7

h-index: 2

Yanan Long

Citations: 92

h-index: 3

Hossein A. Rahmani

Citations: 5

h-index: 2

Christina Q. Knight

Citations: 29

h-index: 4

Yiyang Nan

Citations: 188

h-index: 7

J. Raj

Citations: 12

h-index: 2

Yu Fan

Citations: 37

h-index: 3

Shubham Singh

Citations: 24

h-index: 2

Subramanyam Sahoo

Citations: 7

h-index: 2

Eliya Habba

Citations: 48

h-index: 4

Usman Gohar

Iowa State University

Citations: 400

h-index: 7

Robert Scholz

Citations: 21

h-index: 2

Arjun Subramonian

Meta FAIR

Citations: 4,061

h-index: 15

Jingwei Ni

ETH Zürich

Citations: 438

h-index: 11

Sanmi Koyejo

Citations: 3,864

h-index: 23

Stella Biderman

Citations: 167

h-index: 3

Z. Talat

Citations: 2

h-index: 1

Irene Solaiman

Citations: 4,465

h-index: 9

Srishti Yadav

Citations: 141

h-index: 4

Avijit Ghosh

Citations: 115

h-index: 5

인공지능(AI) 벤치마크는 모델 개발의 진전을 측정하고 배포 결정을 안내하는 데 중요한 역할을 합니다. 그러나 많은 벤치마크는 빠르게 포화 상태에 도달하여, 더 이상 최고 성능 모델을 구분할 수 없게 되어 장기적인 가치를 감소시킵니다. 본 연구에서는 주요 모델 개발사의 기술 보고서에서 선정한 60개의 대규모 언어 모델(LLM) 벤치마크에 대한 벤치마크 포화 현상을 분석합니다. 벤치마크 포화를 유발하는 요인을 파악하기 위해, 작업 설계, 데이터 구성 및 평가 형식을 포괄하는 14가지 속성을 기준으로 벤치마크를 특성화했습니다. 각 속성이 포화율에 어떻게 기여하는지 살펴보는 5가지 가설을 검증했습니다. 분석 결과, 거의 절반의 벤치마크가 포화 상태를 보이는 것으로 나타났으며, 벤치마크의 노후화될수록 포화율이 증가했습니다. 주목할 점은 테스트 데이터의 공개 여부(예: 공개 vs. 비공개)가 포화 현상을 막는 효과가 없으며, 전문가가 선별한 벤치마크는 크라우드소싱된 벤치마크보다 포화 현상에 더 잘 저항한다는 것입니다. 본 연구 결과는 벤치마크의 수명을 연장하는 설계 선택을 강조하고, 보다 지속 가능한 평가 전략을 수립하는 데 필요한 정보를 제공합니다.

Original Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

2 Citations

0 Influential

20.5 Altmetric

104.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!