2603.07048v1 Mar 07, 2026 cs.CV

과거를 되돌아보고 미래를 내다보며: 교차 이미지 주의력 교정 및 주의 기반 선호 학습을 통한 다중 이미지 환각 완화

Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Yaoxin Mao

Citations: 4

h-index: 2

Hao Fang

Citations: 229

h-index: 9

Jiawei Kong

Citations: 128

h-index: 5

Shutao Xia

Citations: 437

h-index: 12

Xiaochen Yang

Citations: 193

h-index: 2

Bin Chen

Citations: 166

h-index: 7

대규모 시각-언어 모델(LVLM)은 놀라운 성능을 보여주지만, 다중 이미지 작업에서 환각 현상을 일으키기 쉽습니다. 우리는 이러한 문제를 기존 주의 메커니즘의 한계와 불충분한 교차 이미지 모델링에 기인한다고 판단했습니다. 이에 영감을 받아, 교차 이미지 주의력 교정 및 선호 학습(CAPL)이라는 체계적인 환각 완화 프레임워크를 제안합니다. CAPL은 아키텍처 수준에서 명시적으로 교차 이미지 상호 작용을 강화하고, 학습 과정에서 진정한 교차 이미지 증거에 대한 의존성을 강화하여 모델의 교차 이미지 연관성에 대한 인식 및 모델링 능력을 향상시킵니다. 구체적으로, (i) 선택 가능한 이미지 토큰 상호 작용 주의 메커니즘을 도입하여 미세한 수준의 교차 이미지 엔티티 정렬 및 정보 흐름을 구축하고, (ii) 교차 이미지 모델링 기반의 선호 최적화 전략을 설계하여 전체 교차 이미지 상호 작용 하에서의 추론 결과와 이미지가 서로 보이지 않는 경우의 결과를 비교함으로써, 모델이 예측을 실제 시각적 증거에 기반하도록 유도하고 텍스트 기반의 잘못된 추론을 완화합니다. 실험 결과는 CAPL이 다양한 모델 아키텍처에서 일관되게 성능을 향상시키며, 다중 이미지 환각 완화 및 일반적인 벤치마크에서 안정적인 성능 향상을 달성한다는 것을 보여줍니다. 특히, 단일 이미지 시각 작업의 성능은 안정적으로 유지되거나 약간 향상되어 강력한 일반화 능력을 나타냅니다.

Original Abstract

Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!