2601.22451v1 Jan 30, 2026 cs.CV

과도한 의존성 함정 극복: 자체 검증 프레임워크를 통한 LVLM의 객체 환각 완화

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Shiyu Liu

Citations: 9

h-index: 2

Xinyi Wen

Citations: 15

h-index: 2

Zhibin Lan

Citations: 146

h-index: 7

Ante Wang

Citations: 413

h-index: 10

Jinsong Su

Citations: 265

h-index: 7

대규모 시각-언어 모델(LVLM)의 발전에도 불구하고, 이미지 캡셔닝 작업에서 객체 환각은 여전히 중요한 문제이며, 모델이 존재하지 않는 객체에 대한 설명을 생성하여 신뢰성을 저해합니다. 기존 연구에서는 이러한 현상이 LVLM의 언어적 사전 지식에 대한 과도한 의존성 때문에 발생한다고 보고하며, 로짓 보정 등을 통해 이를 완화하려는 시도가 있었습니다. 그러나 이러한 시도는 여전히 과도한 의존성에 대한 심층적인 분석이 부족합니다. 본 연구에서는 과도한 의존성에 대한 더 깊은 이해를 얻기 위해 일련의 예비 실험을 수행한 결과, 생성 길이가 증가함에 따라 LVLM이 언어적 사전 지식에 과도하게 의존하면 환각된 객체 토큰의 확률이 증가하여 객체 환각이 더욱 심화되는 것을 확인했습니다. 이러한 문제를 해결하기 위해, 본 연구에서는 LVLM이 객체의 존재 여부에 대해 정확하게 판단할 수 있도록 언어적 사전 지식에 의존하지 않는 검증 방법을 제안합니다. 이를 바탕으로, 과도한 의존성 함정을 극복하기 위한 새로운 학습이 필요 없는 자체 검증 프레임워크를 제안합니다. 이 프레임워크는 샘플링된 후보 캡션에서 객체의 존재 여부를 검증하고, 캡션 선택 또는 집계를 통해 객체 환각을 더욱 완화합니다. 실험 결과, 제안된 프레임워크는 이미지 캡셔닝 작업에서 객체 환각을 크게 완화하는 것으로 나타났습니다 (예: LLaVA-v1.5-7B 모델에서 CHAIRI 지표에서 65.6% 향상). 이러한 결과는 LVLM 자체의 잠재력을 활용하여 환각을 완화하는 새로운 방법을 제시합니다.

Original Abstract

Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs' over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!