2604.03338v1 Apr 03, 2026 econ.GN

아이디어 구상의 병목 현상: 인공지능 생성 연구와 인간 연구 간의 품질 격차 분석

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

Citations: 35

h-index: 4

자율적인 인공지능 시스템은 이제 완전한 경제학 연구 논문을 생성할 수 있지만, 직접 비교에서 인간이 작성한 논문에 비해 성능이 현저히 떨어지는 것으로 나타났습니다. 본 논문에서는 이러한 품질 격차를 연구 아이디어의 품질과 실행 품질이라는 두 가지 독립적인 구성 요소로 분해합니다. 연구 아이디어의 품질을 평가하기 위해, 출판 결정 데이터를 기반으로 미세 조정된 언어 모델 앙상블(Gong, Li, and Zhou, 2026)을 사용하고, 실행 품질을 평가하기 위해 Gemini 3.1 Flash Lite (APE 토너먼트 심판으로 사용된 동일 모델 계열)의 포괄적인 6가지 차원 기준을 사용했습니다. 본 연구는 APE 프로젝트에서 생성된 912개의 인공지능 논문과 American Economic Review 및 AEJ: Economic Policy에 게재된 41개의 인간 작성 논문, 총 953개의 경제학 논문을 분석했습니다. 연구 아이디어 품질 격차는 매우 컸으며(Cohen's d = 2.23, p < 0.001), 인간 작성 논문은 평균적으로 47.1%의 높은 앙상블 예외 확률을 보이는 반면, 인공지능 논문은 16.5%였습니다. 실행 품질 격차 또한 상당하지만 더 작았습니다(d = 0.90, p < 0.001), 인간 작성 논문은 4.38/5.0의 점수를 얻은 반면, 인공지능 논문은 3.84점을 받았습니다. 연구 아이디어 품질은 전체 품질 차이의 약 71%를 차지하며, 실행 품질은 29%를 차지합니다. 가장 큰 실행상의 약점은 메커니즘 분석의 깊이(d = 1.43)이며, 로버스트성은 유의미한 차이가 발견되지 않았습니다. 인공지능 논문 중 74%가 차이-차이 분석(difference-in-differences)을 사용했으며, 7개의 인공지능 논문(0.8%)만이 아이디어와 실행 품질 모두에서 인간 작성 논문의 중앙값보다 높은 점수를 받았습니다. 경쟁력 있는 인공지능 생성 경제학 연구의 주요 병목 현상은 여전히 아이디어 구상 단계에 있습니다.

Original Abstract

Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality, we analyze 953 economics papers -- 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen's d = 2.23, p < 0.001), with human papers achieving 47.1% mean ensemble exceptional probability versus 16.5% for AI. The execution quality gap is also significant but smaller (d = 0.90, p < 0.001), with human papers scoring 4.38/5.0 versus 3.84. Idea quality accounts for approximately 71% of the overall quality difference, with execution contributing 29%. The largest execution weakness is mechanism analysis depth (d = 1.43); no significant difference is found on robustness. We document that 74% of AI papers employ difference-in-differences, and only 7 AI papers (0.8%) surpass the median human paper on both idea and execution quality simultaneously. The primary bottleneck to competitive AI-generated economics research remains ideation.

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!