2602.18458v1 Feb 05, 2026 cs.CY

이야기는 과학이 아니다: 실행 기반 평가를 통한 메커니즘 해석 가능성 연구의 검증

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

Xiaoyan Bai

Citations: 10

h-index: 1

Alex Baumgartner

Citations: 64

h-index: 4

Haojia Sun

Citations: 29

h-index: 2

Ari Holtzman

Citations: 3

h-index: 1

Chenhao Tan

Citations: 119

h-index: 6

다양한 과학 분야에서 발생하는 재현성 위기는 연구의 엄격성과 재현성을 평가하는 논문 중심의 검토 시스템의 한계를 드러냅니다. 자율적으로 연구 결과를 설계하고 생성하는 AI 에이전트는 이러한 과제를 더욱 심화시킵니다. 본 연구에서는 AI 에이전트를 연구 평가자로 활용하여 확장성과 엄격성을 높이는 데 기여하고자 합니다. 우리는 논문 검토뿐만 아니라 코드와 데이터를 함께 분석하여 연구를 검증하는 최초의 실행 기반 평가 프레임워크를 제안합니다. 메커니즘 해석 가능성 연구를 테스트 대상으로 삼아, 표준화된 연구 결과를 구축하고, 실험 과정의 일관성, 결과의 재현성, 그리고 연구 결과의 일반화 가능성을 평가하는 자동화된 평가 프레임워크인 MechEvalAgent를 개발했습니다. 우리의 프레임워크는 인간 평가자와 80% 이상의 일치도를 달성하고, 상당한 방법론적 문제를 식별하며, 인간 검토자가 놓치는 51가지 추가적인 문제를 발견했습니다. 본 연구는 AI 에이전트가 연구 평가를 혁신하고 엄격한 과학적 실천을 위한 길을 열 수 있음을 보여줍니다.

Original Abstract

Reproducibility crises across sciences highlight the limitations of the paper-centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!