2606.09809v1 Jun 08, 2026 cs.AI

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Leshem Choshen

Citations: 1,060

h-index: 14

Avijit Ghosh

Citations: 44

h-index: 4

M. Kochenderfer

Citations: 2,206

h-index: 24

David Manheim

Citations: 59

h-index: 3

Anka Reuel

Citations: 1,295

h-index: 14

Jennifer Mickel

Citations: 218

h-index: 5

Jan Batzner

Citations: 156

h-index: 7

Jenny Chim

Citations: 13

h-index: 2

Jeba Sania

Citations: 15

h-index: 2

Yanan Long

Citations: 104

h-index: 4

Eliya Habba

Citations: 61

h-index: 5

Usman Gohar

Iowa State University

Citations: 445

h-index: 8

Sanmi Koyejo

Citations: 4,394

h-index: 25

Stella Biderman

Citations: 217

h-index: 4

Irene Solaiman

Citations: 4,597

h-index: 9

Asaf Yehudai

Citations: 407

h-index: 9

Srishti Yadav

Citations: 173

h-index: 5

Michael Hardy

Citations: 110

h-index: 5

Max Lamparth

Stanford University

Citations: 986

h-index: 12

Kevin Klyman

Citations: 883

h-index: 15

Aarush Sinha

Citations: 15

h-index: 3

N. Heath

Citations: 0

h-index: 0

Shalaleh Rismani

Citations: 866

h-index: 11

Subramanyam Sahoo

Citations: 6

h-index: 2

M. A. Riegler

Citations: 16

h-index: 2

Wm. Matthew Kennedy

Citations: 13

h-index: 1

Andrew Tran

Citations: 44

h-index: 3

A. Kornilova

Citations: 251

h-index: 5

Damian Stachura

Citations: 42

h-index: 2

F. Friedrich

Citations: 21,928

h-index: 59

Anoop Mishra

Citations: 71

h-index: 5

Yixiong Hao

Georgia Tech

Citations: 9

h-index: 2

Andreas Loehr

Citations: 1

h-index: 1

Ruchira Dhar

Citations: 34

h-index: 4

Sree Harsha Nelaturu

Citations: 79

h-index: 3

Drishti Sharma

Citations: 13

h-index: 2

I. Khire

Citations: 8

h-index: 1

Amit Saha

Citations: 1

h-index: 1

Kabir Manghnani

Citations: 129

h-index: 3

M. Lin

Citations: 77

h-index: 2

Yanan Jiang

Citations: 117

h-index: 6

Yilin Huang

Citations: 7

h-index: 1

Jessica Ji

Citations: 19

h-index: 3

A. Hofmann

Citations: 0

h-index: 0

Mubashara Akhtar

King's College London

Citations: 217

h-index: 5

Nuno Moniz

Citations: 0

h-index: 0

Yacine Jernite

Hugging Face

Citations: 11,816

h-index: 28

Zeerak Ta-lat

Citations: 54

h-index: 1

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

0 Citations

0 Influential

29.5 Altmetric

147.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!