2604.24396v1 Apr 27, 2026 cs.CV

전역 맥락인가, 지역 세부 사항인가? 환각 완화를 위한 적응형 시각적 연관성

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Cao Liu

Citations: 7

h-index: 2

Yubo Jiang

Citations: 1

h-index: 1

Xin Yang

Citations: 1

h-index: 1

Abudukelimu Wuerkaixi

Citations: 147

h-index: 8

Feng-ying Xie

Citations: 3,505

h-index: 31

Zhiguo Jiang

Citations: 194

h-index: 8

Zheming Yuan

Citations: 15

h-index: 3

Ke Zeng

Citations: 11

h-index: 3

Haopeng Zhang

Citations: 95

h-index: 6

Xuxin Cheng

Citations: 13

h-index: 2

비전-언어 모델(VLMs)은 언어적 선호도에 대한 과도한 의존으로 인해 발생하는 객체 환각, 즉 시각적 현실과 모순되는 콘텐츠 생성으로 인해 자주 성능 저하를 겪습니다. 본 연구에서는 훈련 과정 없이 시각적 충실성을 강화하기 위해 디코딩 과정에 직접 개입하는 추론 프레임워크인 Positive-and-Negative Decoding (PND)을 제안합니다. PND은 시각적 특징이 경험적으로 과소 평가되는 VLMs의 중요한 주의 집중 부족 현상을 발견한 데서 영감을 받았습니다. PND은 이중 경로 대비(dual-path contrast)를 통해 이를 수정합니다. 양(+) 경로에서는 다층 주의(multi-layer attention)를 사용하여 중요한 시각적 증거를 증폭시켜 충실한 설명을 장려하고, 주의 집중 부족 현상을 직접적으로 해결합니다. 동시에 음(-) 경로는 핵심 객체의 특징을 식별하고 저하시켜 강력한 반사실적(counterfactual) 사례를 생성하고, 근거 없는 언어적 선호도 기반의 생성을 억제합니다. PND은 각 단계에서 이러한 두 관점에서 모델의 출력을 비교함으로써, 단순히 언어적으로 가능할 뿐만 아니라 시각적으로 사실적인 텍스트 생성을 유도합니다. POPE, MME, CHAIR와 같은 벤치마크에서 수행한 광범위한 실험 결과, PND은 최대 6.5%의 정확도 향상을 달성하여 객체 환각을 크게 줄이는 동시에 서술적 세부 사항을 향상시켰습니다. 또한, LLaVA, InstructBLIP, InternVL, Qwen-VL과 같은 다양한 VLM 아키텍처에서 효과적으로 작동합니다.

Original Abstract

Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

0 Citations

0 Influential

15.5 Altmetric

77.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!