2602.20571v1 Feb 24, 2026 cs.AI

CausalReasoningBenchmark: 인과 관계 식별 및 추정의 분리된 평가를 위한 실세계 벤치마크

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni

Stanford

Citations: 58

h-index: 5

Jiyuan Tan

Citations: 3

h-index: 1

Vasilis Syrgkanis

Citations: 5,018

h-index: 39

자동 인과 추론을 위한 많은 벤치마크가 평균 치료 효과(ATE)와 같은 단일 수치적 결과를 기반으로 시스템의 성능을 평가합니다. 이러한 접근 방식은 인과 분석의 두 가지 뚜렷한 단계를 혼동합니다. 즉, 명시된 가정 하에서 유효한 연구 설계를 수립하는 '식별(identification)' 단계와, 해당 설계를 유한한 데이터에 대해 수치적으로 구현하는 '추정(estimation)' 단계입니다. 본 논문에서는 85편의 동료 검토 연구 논문 및 널리 사용되는 4권의 인과 추론 교재에서 선별된 138개의 실세계 데이터 세트를 포함하는 173개의 쿼리로 구성된 벤치마크인 CausalReasoningBenchmark를 소개합니다. 각 쿼리에 대해 시스템은 (i) 전략, 치료 변수, 결과 변수, 제어 변수 및 모든 설계 관련 요소를 명시하는 구조화된 식별 사양과 (ii) 표준 오차를 갖는 점 추정값을 생성해야 합니다. 본 벤치마크는 이러한 두 가지 구성 요소를 개별적으로 평가하여 세분화된 진단을 가능하게 합니다. 즉, 인과적 추론의 실패와 수치적 실행 오류를 구별할 수 있습니다. 최첨단 LLM을 사용한 기본 결과에 따르면, 모델이 84%의 경우에서 고수준 전략을 올바르게 식별하지만, 전체 식별 사양의 정확도는 30%로 감소합니다. 이는 계산보다는 연구 설계의 미묘한 세부 사항에서 병목 현상이 발생한다는 것을 보여줍니다. CausalReasoningBenchmark는 Hugging Face에서 공개적으로 사용할 수 있으며, 보다 강력한 자동 인과 추론 시스템의 개발을 촉진하도록 설계되었습니다.

Original Abstract

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 138 real-world datasets, curated from 85 peer-reviewed research papers and four widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state-of-the-art LLM show that, while the model correctly identifies the high-level strategy in 84 % of cases, full identification-specification correctness drops to only 30 %, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

0 Citations

0 Influential

19.5 Altmetric

97.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!