2605.01733v1 May 03, 2026 cs.CV

GEASS: 훈련 불필요한 캡션 제어를 통한 비전-언어 모델의 환각 현상 완화

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

Citations: 712

h-index: 4

Citations: 630

h-index: 14

비전-언어 모델(VLMs)은 지시적 추론에 뛰어나지만, 여전히 객체 환각 현상에 취약합니다. 최근 연구에서는 모델이 자체적으로 생성한 캡션을 긍정적인 자원으로 취급하지만, 저희는 단순하게 캡션을 포함시키는 것이 오히려 성능을 저하시킬 수 있다는 것을 발견했습니다. 실제로 Qwen2.5-VL-3B 모델의 HallusionBench 성능이 약 10% 포인트 감소했습니다. 이러한 현상은 두 가지 구조적 특성으로 설명될 수 있습니다. 첫째, 캡션은 모델의 최종 답변뿐만 아니라 추론 과정과 어휘 선택에도 영향을 미칩니다. 둘째, 캡션 오류는 불완전(누락)이 허위 생성보다 훨씬 빈번하지만, 각 허위 생성은 개별적으로 더 큰 영향을 미칩니다. 따라서 캡션의 유용성은 전체 데이터셋이 아닌 개별 쿼리에 따라 달라집니다. 저희는 GEASS(Gated Evidence-Aware Selective Steering)라는 훈련 불필요한 모듈을 제안합니다. 이 모듈은 각 쿼리에 대해 모델이 캡션을 얼마나 활용할지 결정합니다. GEASS는 신뢰 가능한 정보 경로의 신뢰도를 기준으로 캡션을 필터링하고, 캡션이 제공하는 엔트로피 감소량을 기준으로 가중치를 부여하며, 두 경로 간의 불일치가 발생할 경우 정보의 기준을 높입니다. POPE와 HallusionBench 데이터셋에서 4가지 VLMs를 사용하여 실험한 결과, GEASS는 기본 추론 및 대비 디코딩 방식보다 일관되게 성능이 향상되었으며, 쿼리당 단 2번의 추가적인 순전파 연산만 필요합니다.

Original Abstract

Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!