2603.15542v1 Mar 16, 2026 cs.CY

InterveneBench: 실제 사회 시스템에서의 개입 추론 및 인과 연구 설계 능력을 위한 LLM 벤치마킹

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Teqi Hao

Citations: 22

h-index: 1

Zhengyu Shi

Citations: 80

h-index: 4

Libo Wu

Citations: 17

h-index: 2

Lin Zheng

Citations: 133

h-index: 2

Annay Xie

Citations: 5

h-index: 1

Zhichao Chen

Citations: 47

h-index: 5

Guolei Liu

Citations: 6

h-index: 2

Ming Dong

Citations: 2

h-index: 1

Bohao Chen

Citations: 23

h-index: 2

Yuan Qi

Citations: 666

h-index: 10

Shaojie Shi

Citations: 104

h-index: 5

Xinyu Su

Citations: 3

h-index: 1

Rui Xu

Citations: 37

h-index: 3

Zijian Chen

Citations: 113

h-index: 3

Naifu Zhang

Citations: 26

h-index: 2

Yinghui Xu

Citations: 684

h-index: 10

Bohao Lv

Citations: 17

h-index: 2

Zhuo Quan

Citations: 16

h-index: 3

사회 과학 분야의 인과 추론은 실제 정책 개입에 기반한, 전체적인 연구 설계 추론에 의존하지만, 현재 벤치마크는 대규모 언어 모델(LLM)의 이러한 능력을 평가하지 못합니다. 본 논문에서는 실제 사회적 환경에서의 추론 능력을 평가하기 위해 설계된 벤치마크, InterveneBench를 제시합니다. InterveneBench의 각 항목은 실증적인 사회 과학 연구에서 파생되었으며, 모델이 사전 정의된 인과 그래프나 구조 방정식에 접근할 수 없는 상태에서 정책 개입과 식별 가정을 추론하도록 요구합니다. InterveneBench는 다양한 정책 분야에 걸쳐 744개의 동료 검토를 거친 연구로 구성되어 있습니다. 실험 결과는 최첨단 LLM이 이러한 환경에서 어려움을 겪는다는 것을 보여줍니다. 이러한 한계를 극복하기 위해, 우리는 다중 에이전트 프레임워크인 STRIDES를 추가로 제안합니다. STRIDES는 최첨단 추론 모델보다 상당한 성능 향상을 달성합니다. 저희의 코드 및 데이터는 https://github.com/Sii-yuning/STRIDES 에서 확인할 수 있습니다.

Original Abstract

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

1 Citations

0 Influential

25 Altmetric

126.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!