2606.09809v1 Jun 08, 2026 cs.AI

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Leshem Choshen
Leshem Choshen
Citations: 1,060
h-index: 14
Avijit Ghosh
Avijit Ghosh
Citations: 44
h-index: 4
M. Kochenderfer
M. Kochenderfer
Citations: 2,206
h-index: 24
David Manheim
David Manheim
Citations: 59
h-index: 3
Anka Reuel
Anka Reuel
Citations: 1,295
h-index: 14
Jennifer Mickel
Jennifer Mickel
Citations: 218
h-index: 5
Jan Batzner
Jan Batzner
Citations: 156
h-index: 7
Jenny Chim
Jenny Chim
Citations: 13
h-index: 2
Jeba Sania
Jeba Sania
Citations: 15
h-index: 2
Yanan Long
Yanan Long
Citations: 104
h-index: 4
Eliya Habba
Eliya Habba
Citations: 61
h-index: 5
Usman Gohar
Usman Gohar
Iowa State University
Citations: 445
h-index: 8
Sanmi Koyejo
Sanmi Koyejo
Citations: 4,394
h-index: 25
Stella Biderman
Stella Biderman
Citations: 217
h-index: 4
Irene Solaiman
Irene Solaiman
Citations: 4,597
h-index: 9
Asaf Yehudai
Asaf Yehudai
Citations: 407
h-index: 9
Srishti Yadav
Srishti Yadav
Citations: 173
h-index: 5
Michael Hardy
Michael Hardy
Citations: 110
h-index: 5
Max Lamparth
Max Lamparth
Stanford University
Citations: 986
h-index: 12
Kevin Klyman
Kevin Klyman
Citations: 883
h-index: 15
Aarush Sinha
Aarush Sinha
Citations: 15
h-index: 3
N. Heath
N. Heath
Citations: 0
h-index: 0
Shalaleh Rismani
Shalaleh Rismani
Citations: 866
h-index: 11
Subramanyam Sahoo
Subramanyam Sahoo
Citations: 6
h-index: 2
M. A. Riegler
M. A. Riegler
Citations: 16
h-index: 2
Wm. Matthew Kennedy
Wm. Matthew Kennedy
Citations: 13
h-index: 1
Andrew Tran
Andrew Tran
Citations: 44
h-index: 3
A. Kornilova
A. Kornilova
Citations: 251
h-index: 5
Damian Stachura
Damian Stachura
Citations: 42
h-index: 2
F. Friedrich
F. Friedrich
Citations: 21,928
h-index: 59
Anoop Mishra
Anoop Mishra
Citations: 71
h-index: 5
Yixiong Hao
Yixiong Hao
Georgia Tech
Citations: 9
h-index: 2
Andreas Loehr
Andreas Loehr
Citations: 1
h-index: 1
Ruchira Dhar
Ruchira Dhar
Citations: 34
h-index: 4
Sree Harsha Nelaturu
Sree Harsha Nelaturu
Citations: 79
h-index: 3
Drishti Sharma
Drishti Sharma
Citations: 13
h-index: 2
I. Khire
I. Khire
Citations: 8
h-index: 1
Amit Saha
Amit Saha
Citations: 1
h-index: 1
Kabir Manghnani
Kabir Manghnani
Citations: 129
h-index: 3
M. Lin
M. Lin
Citations: 77
h-index: 2
Yanan Jiang
Yanan Jiang
Citations: 117
h-index: 6
Yilin Huang
Yilin Huang
Citations: 7
h-index: 1
Jessica Ji
Jessica Ji
Citations: 19
h-index: 3
A. Hofmann
A. Hofmann
Citations: 0
h-index: 0
Mubashara Akhtar
Mubashara Akhtar
King's College London
Citations: 217
h-index: 5
Nuno Moniz
Nuno Moniz
Citations: 0
h-index: 0
Yacine Jernite
Yacine Jernite
Hugging Face
Citations: 11,816
h-index: 28
Zeerak Ta-lat
Zeerak Ta-lat
Citations: 54
h-index: 1

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

0 Citations
0 Influential
29.5 Altmetric
147.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!