2602.11073v2 Feb 11, 2026 cs.CV

이미지를 활용한 자기 성찰적 시각적 사고

Chatting with Images for Introspective Visual Thinking

Q. Liu

Citations: 7,875

h-index: 30

Liang Wang

Institute of Automation, Chinese Academy of Sciences

Citations: 614

h-index: 16

Jun Wu

Citations: 726

h-index: 16

Jian Guan

Citations: 120

h-index: 5

Shuning Wu

Citations: 83

h-index: 1

Wei Wu

Citations: 82

h-index: 1

Tienie Tan

Citations: 0

h-index: 0

현재의 대규모 시각-언어 모델(LVLM)은 일반적으로 단일 패스의 시각적 인코딩에 기반한 텍스트만 사용한 추론에 의존하는데, 이는 종종 미세한 시각 정보의 손실을 초래합니다. 최근에 제안된 "이미지로 생각하기" 접근 방식은 외부 도구나 코드를 사용하여 이미지를 조작함으로써 이러한 제한점을 완화하려고 시도합니다. 그러나 결과적으로 생성되는 시각적 상태는 종종 언어적 의미론과 충분히 연결되지 않아 효과적인 다중 모드 정렬을 저해합니다. 특히 시각적 의미나 기하학적 관계를 원격 영역이나 여러 이미지에 걸쳐 추론해야 하는 경우 이러한 문제가 두드러집니다. 이러한 과제를 해결하기 위해, 우리는 "이미지와 대화하기"라는 새로운 프레임워크를 제안합니다. 이 프레임워크는 시각적 조작을 언어 지향적 특징 변조로 재구성합니다. 표현력이 풍부한 언어 프롬프트의 지침에 따라, 모델은 여러 이미지 영역에 대한 동적 재인코딩을 수행하여 언어적 추론과 시각적 상태 업데이트 간의 긴밀한 연결을 가능하게 합니다. 우리는 이 패러다임을 ViLaVT라는 새로운 LVLM에 구현했습니다. ViLaVT는 이러한 상호 작용적인 시각적 추론을 위해 특별히 설계된 동적 비전 인코더를 갖추고 있으며, 감독 미세 조정과 강화 학습을 결합한 두 단계의 교육 과정을 통해 효과적인 추론 능력을 향상시켰습니다. 8개의 벤치마크에서 수행한 광범위한 실험 결과, ViLaVT는 상당한 개선을 달성했으며, 특히 복잡한 멀티 이미지 및 비디오 기반 공간 추론 작업에서 두드러진 성능 향상을 보였습니다.

Original Abstract

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

0 Citations

0 Influential

15 Altmetric

75.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!