2601.22574v1 Jan 30, 2026 cs.CV

시공간-의미 대비 디코딩을 통한 비디오 거대 언어 모델의 환각 현상 완화

Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

Tong Zhang

Citations: 74

h-index: 3

Yuan Gao

Citations: 56

h-index: 5

Jinman Zhao

Citations: 27

h-index: 3

Xin Xu

Citations: 51

h-index: 3

Han Bao

Citations: 19

h-index: 2

Zonghui Wang

Citations: 99

h-index: 7

Wenzhi Chen

Citations: 26

h-index: 2

Wenbin Xing

Citations: 16

h-index: 2

비디오 거대 언어 모델은 비디오 이해, 질문 답변, 추론 등 다양한 작업에서 뛰어난 성능을 보이지만, 여전히 비디오 내용이나 사실적 증거와 일치하지 않는 결과를 생성하는 환각 현상이라는 문제를 안고 있습니다. 기존의 비디오 환각 완화 디코딩 방법들은 비디오의 시공간적 특징을 고려하지만, 대부분 경험적인 설계에 의존하여 환각의 근본적인 원인과 미세한 시간적, 의미적 상관관계를 정확하게 파악하지 못하고, 결과적으로 복잡한 상황에서 제한적인 성능과 일반화 능력을 보입니다. 본 연구에서는 비디오 환각 현상을 보다 효과적으로 완화하기 위해, 시공간-의미 대비 디코딩(Spatiotemporal-Semantic Contrastive Decoding)이라는 새로운 디코딩 전략을 제안합니다. 이 전략은 비디오 특징의 시공간적 일관성과 의미적 연관성을 의도적으로 파괴하여 부정적인 특징을 구성하고, 추론 과정에서 원본 비디오 특징에 대한 대비 디코딩을 통해 비디오 환각 현상을 억제합니다. 광범위한 실험 결과, 제안하는 방법은 환각 현상의 발생을 효과적으로 줄일 뿐만 아니라, 모델의 일반적인 비디오 이해 및 추론 능력을 유지하는 것을 보여줍니다.

Original Abstract

Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!