2604.11502v1 Apr 13, 2026 cs.CL

METER: 대규모 언어 모델의 다층적 맥락 기반 인과 추론 평가

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li

Citations: 0

h-index: 0

Chao Hao

Citations: 97

h-index: 2

Hongyao Chen

Citations: 4

h-index: 1

Wenqiang Lei

Citations: 247

h-index: 10

Chen Huang

Citations: 250

h-index: 10

Xiaojie Wei

Citations: 0

h-index: 0

See-Kiong Ng

Citations: 24

h-index: 2

맥락 기반 인과 추론은 대규모 언어 모델(LLM)에게 있어 매우 중요하지만 어려운 능력입니다. 기존의 벤치마크는 종종 이 능력을 단편적인 환경에서 평가하며, 맥락의 일관성을 보장하지 못하거나 전체적인 인과 계층 구조를 포괄하지 못합니다. 이러한 문제를 해결하기 위해, 우리는 METER를 개발하여 LLM을 통합된 맥락 설정 하에서 인과 계층의 세 가지 수준 모두에 대해 체계적으로 평가합니다. 다양한 LLM에 대한 광범위한 평가 결과, 작업이 인과 계층을 따라 상승함에 따라 성능이 크게 저하되는 것을 확인했습니다. 이러한 성능 저하의 원인을 분석하기 위해, 오류 패턴 식별 및 내부 정보 흐름 추적을 통해 심층적인 메커니즘 분석을 수행했습니다. 분석 결과, 두 가지 주요 실패 요인이 확인되었습니다. (1) LLM은 낮은 수준의 인과 관계에서 사실적으로는 정확하지만 인과적으로 관련 없는 정보에 의해 주의가 산만해질 수 있습니다. (2) 작업이 인과 계층을 따라 상승함에 따라, 제공된 맥락에 대한 충실도가 저하되어 성능이 감소합니다. 본 연구는 LLM의 맥락 기반 인과 추론 메커니즘에 대한 이해를 높이고, 향후 연구를 위한 중요한 기반을 마련했다고 생각합니다. 저희의 코드 및 데이터셋은 https://github.com/SCUNLP/METER 에서 이용하실 수 있습니다.

Original Abstract

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

0 Citations

0 Influential

25 Altmetric

125.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!