2601.12499v1 Jan 18, 2026 cs.AI

멀티홉 질의응답의 실패 유형: 가장 약한 고리의 법칙과 인식 병목 현상

Failure Modes in Multi-Hop QA: The Weakest Link Law and the Recognition Bottleneck

Meiru Zhang

Citations: 182

h-index: 7

Zaiqiao Meng

Citations: 20

h-index: 2

Nigel Collier

Citations: 20

h-index: 2

거대한 문맥 윈도우로 확장되었음에도 불구하고, 대규모 언어 모델(LLM)은 특정 위치의 정보를 간과하게 만드는 내재적인 위치 편향으로 인해 멀티홉 추론에 어려움을 겪고 있습니다. 이러한 실패가 근거를 찾지 못하는 능력 부족(인식 실패)에서 기인하는지, 아니면 이를 통합하지 못하는 것(종합 실패)에서 기인하는지는 불분명합니다. 본 연구에서는 선택된 위치로 주의(attention)를 명시적으로 유도하여 이러한 메커니즘을 분리하는 의미론적 프로브인 다중 초점 주의 지시(Multi-Focus Attention Instruction, MFAI)를 소개합니다. 두 가지 멀티홉 QA 과제(MuSiQue, NeoQA)에 대해 5개의 LLM을 분석한 결과, 우리는 멀티홉 추론 성능이 가장 식별하기 어려운 근거의 성능 수준으로 하락한다는 '가장 약한 고리의 법칙'을 규명했습니다. 결정적으로, 이러한 실패는 사실 간의 선형적 거리가 아닌 절대적 위치에 의해 지배됩니다(성능 분산 3% 미만). 더 나아가 우리는 주의 유도의 이중성을 확인했습니다. 올바른(matched) MFAI는 인식 병목 현상을 해결하여 식별이 어려운 위치에서의 정확도를 최대 11.5% 향상시키지만, 잘못된(misleading) MFAI는 실제 과제에서 혼란을 야기하는 반면 합성 과제에서는 성공적으로 걸러집니다. 마지막으로, 시스템-2(System-2) 추론을 활용하는 '생각하는(thinking)' 모델들은 노이즈가 많고 긴 문맥 환경에서도 필요한 정보를 효과적으로 찾아내고 통합하여, 정답 근거만 주어진(gold-only) 베이스라인과 대등한 성능을 보임을 입증합니다.

Original Abstract

Despite scaling to massive context windows, Large Language Models (LLMs) struggle with multi-hop reasoning due to inherent position bias, which causes them to overlook information at certain positions. Whether these failures stem from an inability to locate evidence (recognition failure) or integrate it (synthesis failure) is unclear. We introduce Multi-Focus Attention Instruction (MFAI), a semantic probe to disentangle these mechanisms by explicitly steering attention towards selected positions. Across 5 LLMs on two multi-hop QA tasks (MuSiQue and NeoQA), we establish the "Weakest Link Law": multi-hop reasoning performance collapses to the performance level of the least visible evidence. Crucially, this failure is governed by absolute position rather than the linear distance between facts (performance variance $<3%$). We further identify a duality in attention steering: while matched MFAI resolves recognition bottlenecks, improving accuracy by up to 11.5% in low-visibility positions, misleading MFAI triggers confusion in real-world tasks but is successfully filtered in synthetic tasks. Finally, we demonstrate that "thinking" models that utilize System-2 reasoning, effectively locate and integrate the required information, matching gold-only baselines even in noisy, long-context settings.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!