2605.14754v1 May 14, 2026 cs.AI

XDomainBench: 고차원 과학 지식 통합 과정에서의 추론 오류 진단

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

Yilei Zhao

Citations: 328

h-index: 6

Che Wang

Citations: 14

h-index: 2

Jiaming Zhang

Citations: 16

h-index: 3

Wei Yang Bryan Lim

Citations: 71

h-index: 4

Fuyao Zhang

Citations: 7

h-index: 2

Tiantong Wu

Citations: 3

h-index: 1

Yurong Hao

Citations: 6

h-index: 2

Z. Gong

Citations: 0

h-index: 0

Yikun Hou

Citations: 13

h-index: 2

Foo Ping

Citations: 0

h-index: 0

Fei Huang

Citations: 98

h-index: 3

Chau Yuen

Citations: 7

h-index: 2

대규모 언어 모델(LLM)은 지식 통합에 점점 더 많이 활용되고 있지만, 과학 지식에서의 합성 일반화 능력은 아직 제대로 연구되지 않았습니다. 기존 벤치마크는 주로 단일 단계로 제한된 시나리오에 초점을 맞추고 있어, 실제 과학 연구 워크플로우에서 나타나는 능력의 한계를 제대로 반영하지 못합니다. 이러한 문제를 해결하기 위해, 우리는 실시간 상호 작용이 가능한 학제 간 과학적 추론을 진단하기 위한 벤치마크인 XDomainBench를 소개합니다. XDomainBench는 구성 순서와 혼합 구조를 형식화하여 단일 학문 분야부터 학제 분야까지 체계적인 스트레스 테스트를 가능하게 하며, 20개의 분야와 4가지 작업 범주에 걸쳐 8,598개의 상호 작용 세션을 포함하고 있습니다. 또한, 난이도와 학문 분야 혼합의 역학 관계를 반영하는 8가지 현실적인 경로 패턴을 포함하여 실제 AI4S 시나리오를 시뮬레이션합니다. 대규모 LLM 평가 결과, 구성 순서가 증가함에 따라 체계적인 추론 오류가 발생하는 것으로 나타났습니다. 이는 두 가지 주요 원인에서 비롯됩니다. 첫째, 학문 분야의 통합으로 인해 발생하는 직접적인 난이도 증가, 둘째, 경로 패턴에 의해 유발되어 상호 작용 과정에서 오류가 누적되고 추론이 중단되며 학문 분야가 혼동되어 결국 전체 세션이 실패하는 간접적인 실패입니다.

Original Abstract

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!