2604.01989v1 Apr 02, 2026 cs.CV

정지 상태의 주의는 정지 상태를 유지한다: 시각적 관성의 극복을 통한 인지적 환각 완화

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Jiwen Lu

Citations: 264

h-index: 4

Bo Gong

Citations: 0

h-index: 0

Yujin Zheng

Citations: 0

h-index: 0

Fanye Kong

Citations: 3

h-index: 1

Jie Zhou

Citations: 22

h-index: 2

정지 상태의 물체가 정지 상태를 유지하는 것과 유사하게, 멀티모달 대규모 언어 모델(MLLM)에서 시각적 주의는 두드러진 관성을 보입니다. 즉, 초기 디코딩 단계에서 한 번 정해지면 크게 변하지 않으며, 인지적 추론에 필요한 복합적인 이해를 뒷받침하지 못합니다. 기존의 환각 완화 방법은 주로 객체의 존재 또는 속성과 관련된 지각적 환각을 대상으로 하지만, 객체 간의 관계를 추론해야 하는 인지적 환각에는 효과적이지 않습니다. 토큰 단위의 주의 분석을 통해, 우리는 이러한 시각적 관성을 핵심 요인으로 확인했습니다. 즉, 의미적으로 중요한 영역에 대한 주의는 지속적으로 특정 영역에 집중되어 있으며, 관계 추론을 동적으로 지원하지 못합니다. 이에 우리는 훈련 과정이 필요 없는 관성 인지 시각적 활성화(Inertia-aware Visual Excitation, IVE) 방법을 제안합니다. IVE는 시각적 주의의 동적 반응성을 통해 인지적 추론을 모델링함으로써 이러한 관성 패턴을 깨뜨립니다. 구체적으로, IVE는 과거 주의 패턴과 비교하여 동적으로 변화하는 시각적 토큰을 선택하고, 동시에 관성적인 행동을 보이는 토큰을 구별합니다. 또한, 복합적인 추론을 더욱 촉진하기 위해, IVE는 과도한 집중을 억제하고 특정 영역 내 주의의 지속성을 제한하는 관성 기반의 페널티를 도입합니다. 광범위한 실험 결과, IVE는 다양한 기본 MLLM과 여러 환각 평가 지표에서 효과적임을 보여주며, 특히 인지적 환각 완화에 효과적입니다.

Original Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!