2604.03556v1 Apr 04, 2026 cs.CV

중요한 것은 집중: 시각-언어 모델의 환각 현상 완화를 위한 단계 인식 기반 억제

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Sangwoong Yoon

Citations: 85

h-index: 4

Sohyeong Kim

Citations: 73

h-index: 4

Kyeongbo Kong

Citations: 4

h-index: 1

대규모 시각-언어 모델(LVLM)은 다중 모드 추론 분야에서 상당한 발전을 이루었지만, 여전히 입력 이미지에 존재하지 않는 객체에 대한 설명을 생성하는 환각 현상에 취약합니다. 최근 연구에서는 시각 인코더의 신뢰할 수 없는 시각 정보를 억제하여 환각 현상을 완화하려는 시도가 있었지만, 많은 방법들이 각 입력에 대해 반복적인 최적화를 수행하여 상당한 추론 지연 시간을 초래합니다. 본 연구에서는 LVLM의 시각 인코더 내부의 어텐션 동역학을 조사하고, 시각 정보 처리의 일관된 세 단계 구조(확산, 집중, 재확산)를 밝혀냈습니다. 분석 결과, 환각 현상은 집중 단계에서 낮은 어텐션을 받는 토큰에 특히 민감하게 반응하는 것으로 나타났습니다. 이러한 관찰을 바탕으로, 우리는 집중 단계에서 이러한 토큰을 선택적으로 억제하는 경량화된 추론 시간 개입 방법을 제안합니다. 이 방법은 단일 순방향 패스에서 얻은 통계 정보를 활용하여 학습 없이 작동하며, Determinantal Point Process (DPP)를 사용하여 다양한 시각적 단서를 유지하면서 중복된 토큰을 필터링합니다. 다양한 LVLM 구조와 디코딩 전략에 대한 광범위한 실험 결과, 제안된 방법은 환각 관련 지표를 지속적으로 감소시키면서도 경쟁력 있는 캡션 품질을 유지하는 것으로 나타났습니다. 또한, 적대적 불확실성 추정 방법과 비교했을 때, 제안된 방법은 거의 추가적인 추론 지연 시간 없이 유사한 수준의 환각 현상 완화 효과를 달성합니다.

Original Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!