2604.10506v1 Apr 12, 2026 cs.AI

시각-언어 모델의 공간-시간적 환각 현상 완화를 위한 점진적 훈련 전략

A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

Tao Jin

Citations: 2

h-index: 1

Zhou Zhao

Citations: 163

h-index: 8

Sashuai Zhou

Citations: 49

h-index: 4

Xiaoda Yang

Citations: 447

h-index: 9

Lixin Yang

Citations: 1,395

h-index: 16

Shuai Yang

Citations: 327

h-index: 8

Xiangyu Yue

Citations: 3

h-index: 1

Can Wang

Citations: 0

h-index: 0

Jingyang Xue

Citations: 4

h-index: 1

Menglan Tang

Citations: 0

h-index: 0

Checheng Yu

Citations: 52

h-index: 4

Xunzhe Zhou

Citations: 23

h-index: 1

시각-언어 모델(VLM)은 정적 이미지 이해 분야에서 상당한 발전을 이루었지만, 여전히 공간-시간적 추론에 있어 중요한 어려움을 겪고 있습니다. 주요 문제점 중 하나는 "다중 이미지 추론 환각"으로, 이는 순방향 및 역방향 시간적 질의 간의 성능 격차를 통해 드러나며, 이는 진정한 인과 관계 이해보다는 피상적인 단서에 의존하는 것을 의미합니다. 이를 완화하기 위해, 우리는 먼저 복잡한 추론 과정을 상세한 공간-시간적 단계와 명확한 판단으로 분해하는 새로운 Chain-of-Thought (CoT) 데이터셋을 개발했습니다. 이를 바탕으로, 우리는 점진적인 훈련 프레임워크를 제시합니다. 이 프레임워크는 먼저 CoT 데이터셋을 활용한 지도 학습을 통해 논리적 구조를 학습시키고, 이후 광범위한 일반화 성능을 확보하기 위해 확장 가능한 약하게 레이블링된 데이터로 미세 조정을 수행합니다. 우리의 실험 결과는 이 접근 방식이 핵심 모델의 정확도를 향상시킬 뿐만 아니라, 순방향-역방향 성능 격차를 70% 이상에서 6.53%로 크게 줄이는 것을 보여줍니다. 이는 본 방법이 진정한 동적 추론 능력을 개발하고 현재 VLM에 내재된 시간적 편향을 줄이는 데 효과적임을 확인합니다.

Original Abstract

Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70\% to only 6.53\%. This confirms the method's ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!