2604.13940v1 Apr 15, 2026 cs.AI

AI 기반 대규모 동료 평가: AAAI-26 AI 리뷰 파일럿 프로그램

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

Junyi Li

Citations: 20

h-index: 3

Matthew E. Taylor

Citations: 13

h-index: 2

Joydeep Biswas

Citations: 121

h-index: 6

Sheila Schoepp

Citations: 16

h-index: 2

G. Vasan

Citations: 3

h-index: 1

Anthony Opipari

Citations: 50

h-index: 2

Zichao Hu

Citations: 163

h-index: 6

S. Joseph

Citations: 40

h-index: 2

Matthew Lease

Citations: 11

h-index: 2

Peter Stone

Citations: 84

h-index: 4

K. Wagstaff

Citations: 5,638

h-index: 35

O. C. Jenkins

Citations: 4,853

h-index: 37

Arthur Zhang

Citations: 138

h-index: 4

논문 제출 건수가 급증하면서 과학적 동료 평가는 심각한 부담을 느끼고 있으며, 이는 평가의 품질, 일관성 및 적시성을 유지하기 점점 더 어렵게 만듭니다. 최근 AI 기술의 발전으로 인해, AI를 동료 평가에 활용하는 방안이 논의되고 있지만, AI가 실제 컨퍼런스 규모에서 기술적으로 타당한 리뷰를 생성할 수 있는지가 중요한 미해결 과제입니다. 본 연구에서는 AI 지원 동료 평가의 최초 대규모 현장 적용 사례를 보고합니다. AAAI-26의 모든 메인 트랙 논문에 대해, 최첨단 시스템에서 생성된 명확하게 식별 가능한 AI 리뷰가 하나씩 제공되었습니다. 이 시스템은 최첨단 모델, 도구 활용, 그리고 안전 장치를 결합하여 22,977개의 전체 리뷰 논문에 대해 1일 미만이라는 짧은 시간 안에 리뷰를 생성했습니다. AAAI-26의 저자와 프로그램 위원회 구성원을 대상으로 실시한 대규모 설문 조사 결과, 참가자들은 AI 리뷰가 유용할 뿐만 아니라, 기술적 정확성 및 연구 제안과 같은 중요한 측면에서 인간 리뷰보다 선호한다는 것을 알 수 있었습니다. 또한, 새로운 벤치마크를 도입하여, 우리 시스템이 다양한 과학적 약점을 감지하는 데 있어 단순 LLM 기반의 리뷰보다 훨씬 우수한 성능을 보인다는 것을 확인했습니다. 이러한 결과들은 최첨단 AI 방법론이 이미 컨퍼런스 규모의 과학적 동료 평가에 의미 있는 기여를 할 수 있으며, 연구 평가를 위한 차세대 인간-AI 협력의 길을 열어준다는 것을 보여줍니다.

Original Abstract

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

2 Citations

1 Influential

18.5 Altmetric

96.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!