2601.04073v1 Jan 07, 2026 cs.CV

크로스 모달 충돌 상황에서 대규모 멀티모달 모델의 추론 일관성 분석

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

Bing Qin

Citations: 614

h-index: 11

Zhihao Zhu

Citations: 8

h-index: 2

Jiafeng Liang

Citations: 68

h-index: 5

Shixin Jiang

Citations: 107

h-index: 5

Jinlan Fu

Citations: 185

h-index: 7

Ming Liu

Citations: 574

h-index: 11

Guanglu Sun

Citations: 97

h-index: 6

See-Kiong Ng

Citations: 85

h-index: 5

대규모 멀티모달 모델(LMM)은 Chain-of-Thought(CoT)를 활용하여 비디오 추론 분야에서 뛰어난 성능을 보여주었습니다. 그러나 이러한 모델의 추론 과정의 안정성은 여전히 의문입니다. 본 논문에서는 '텍스트 관성(textual inertia)'이라는 중요한 오류 방식을 밝혀냅니다. 이는 모델이 사고 과정에서 텍스트 환각이 발생하면, 오류가 있는 텍스트에 맹목적으로 따라가면서 상반되는 시각적 증거를 무시하는 현상입니다. 이를 체계적으로 조사하기 위해, 우리는 다양한 LMM 모델의 추론 과정에 구조적인 변화를 주입하는 LogicGraph Perturbation Protocol을 제안하여, 모델의 자기 성찰 능력을 평가합니다. 실험 결과, 모델이 자체적으로 오류를 수정하는 경우는 10% 미만으로 나타났으며, 대부분의 경우 맹목적인 텍스트 오류 전파에 취약했습니다. 이러한 문제를 완화하기 위해, 우리는 Active Visual-Context Refinement라는 훈련이 필요 없는 추론 방법을 제안합니다. 이 방법은 능동적인 시각 재연결 메커니즘을 사용하여 세밀한 검증을 강화하고, 적응적인 컨텍스트 정제 전략을 통해 추론 과정을 요약하고 노이즈를 제거합니다. 실험 결과, 제안하는 방법은 환각 전파를 크게 줄이고 추론의 견고성을 향상시키는 것을 보여주었습니다.

Original Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.

3 Citations

0 Influential

5.5 Altmetric

30.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!