2603.11001v1 Mar 11, 2026 cs.CY

무작위 대조 실험 및 인간 능력 향상 연구: 최첨단 AI 평가를 위한 방법론적 과제 및 실용적인 해결책

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Valerie Chen

Carnegie Mellon University

Citations: 1,084

h-index: 16

Kevin Wei

Citations: 487

h-index: 8

Patricia Paskov

Citations: 61

h-index: 4

Shengxin Hong

Citations: 10

h-index: 2

Dan Bateyko

Citations: 4

h-index: 1

Xavier Roberts-Gaal

Citations: 53

h-index: 3

Carson Ezell

Citations: 525

h-index: 8

Gailius Praninskas

Citations: 1

h-index: 1

Umang Bhatt

Citations: 3,354

h-index: 21

E. Guest

Citations: 82

h-index: 2

인간 능력 향상 연구는 AI가 인간의 수행 능력에 미치는 영향을 기준 상태와 비교하여 측정하는 연구로, 일반적으로 무작위 대조 실험(RCT) 방법론을 사용하며, 최첨단 AI 시스템의 배포, 거버넌스 및 안전 결정에 점점 더 많이 활용되고 있습니다. 이러한 연구의 기반 방법론은 잘 확립되어 있지만, 특히 결과가 중요한 의사 결정을 내리는 데 사용될 때, 이러한 방법론이 최첨단 AI 시스템의 고유한 특성과 어떻게 상호 작용하는지에 대한 연구는 부족합니다. 본 연구에서는 생물 보안, 사이버 보안, 교육 및 노동 분야에서 인간 능력 향상 연구를 수행한 경험이 있는 16명의 전문가를 대상으로 인터뷰를 진행한 결과를 제시합니다. 인터뷰를 통해 전문가들은 표준적인 인과 추론 가정과 연구 대상 자체 사이의 지속적인 긴장을 경험했음을 밝혔습니다. 빠르게 진화하는 AI 시스템, 변화하는 기준, 이질적이고 변화하는 사용자 숙련도, 그리고 개방적인 실제 환경은 내부 타당성, 외부 타당성 및 구성 개념 타당성에 대한 가정을 위협하며, 이는 인간 능력 향상 연구에서 얻은 증거의 해석 및 적절한 활용을 어렵게 만듭니다. 본 연구는 이러한 과제들을 인간 능력 향상 연구의 주요 단계에 걸쳐 종합적으로 분석하고, 전문가들이 보고한 해결책과 연결하여, 중요한 의사 결정 과정에서 인간 능력 향상 연구의 한계와 적절한 활용 방법을 명확히 제시합니다.

Original Abstract

Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.

1 Citations

0 Influential

10.5 Altmetric

53.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!