2605.26029v1 May 25, 2026 cs.AI

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hao Peng
Hao Peng
Citations: 99
h-index: 2
Dylan Zhang
Dylan Zhang
Citations: 93
h-index: 5
Yuen Chen
Yuen Chen
Citations: 281
h-index: 4
Junlin Yang
Junlin Yang
Citations: 1,207
h-index: 11
Xiangchen Song
Xiangchen Song
Citations: 171
h-index: 8
Qirun Dai
Qirun Dai
Citations: 67
h-index: 3
Xiao Liu
Xiao Liu
Citations: 27
h-index: 2
Aniket Vashishtha
Aniket Vashishtha
Citations: 192
h-index: 6
Jing Shi
Jing Shi
Citations: 3
h-index: 1
Chenhao Tan
Chenhao Tan
Citations: 9
h-index: 2

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

0 Citations
0 Influential
5.5 Altmetric
27.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!