2603.02556v1 Mar 03, 2026 cs.CV

대조를 통해: VLM의 자기 개선 시각적 추론

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Jiashen Hua

Citations: 103

h-index: 3

Junyi Feng

Citations: 10

h-index: 2

Zhiyu Pan

Citations: 20

h-index: 2

Yizheng Wu

Citations: 104

h-index: 6

Bing Deng

Citations: 47

h-index: 5

Zhiguo Cao

Citations: 71

h-index: 5

Shaotian Yan

Citations: 166

h-index: 5

Jieping Ye

Citations: 58

h-index: 6

추론 능력은 대규모 언어 모델의 핵심 기능으로 부상했습니다. 언어 관련 작업에서 이러한 기능은 이후 미세 조정에 활용될 추론 경로를 개선하는 자기 개선 기술을 통해 향상될 수 있습니다. 그러나 이러한 언어 기반 자기 개선 접근 방식을 시각 언어 모델(VLM)에 적용하는 것은 독특한 과제를 안고 있습니다. 즉, 추론 경로에서 발생하는 시각적 환상을 효과적으로 검증하거나 수정할 수 없다는 것입니다. 저희는 시각적 대조에 대한 핵심적인 관찰에서 출발했습니다. 즉, 대조적인 VQA 쌍(시각적으로 유사하지만 질문이 동일한 두 이미지)이 제시되면 VLM은 관련 시각적 단서를 더욱 정확하게 식별합니다. 이러한 관찰에 영감을 받아, 저희는 시각적 환상을 줄이기 위해 시각적 대조를 활용하는 새로운 자기 개선 프레임워크인 Visual Contrastive Self-Taught Reasoner (VC-STaR)를 제안합니다. 저희는 다양한 VQA 데이터 세트를 수집하고, 다중 모드 유사성을 기준으로 대조 쌍을 구성하고, VC-STaR을 사용하여 추론을 생성합니다. 그 결과, 저희는 새로운 시각적 추론 데이터 세트인 VisCoR-55K를 얻었으며, 이는 다양한 VLM의 추론 능력을 지도 학습 미세 조정을 통해 향상시키는 데 사용됩니다. 광범위한 실험 결과, VC-STaR은 기존의 자기 개선 접근 방식보다 뛰어난 성능을 보일 뿐만 아니라, 최첨단 시각적 추론 데이터 세트를 사용하여 미세 조정된 모델보다도 우수한 성능을 보여주었습니다. 이는 VLM의 고유한 대조 능력이 자체 시각적 추론 능력을 향상시킬 수 있음을 보여줍니다. 프로젝트 정보: https://github.com/zhiyupan42/VC-STaR.

Original Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

0 Citations

0 Influential

26.4657359028 Altmetric

132.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!