2604.24339v1 Apr 27, 2026 cs.CV

더 멀리 보고, 더 깊이 생각하기: 저수준 시각적 단서와 반성을 활용하여 VLM의 추론 능력 향상

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Yumeng Zhang

Citations: 91

h-index: 3

Naiming Liu

Citations: 279

h-index: 9

Zhi-zong Wu

Citations: 97

h-index: 4

Tong Wang

Citations: 65

h-index: 1

Shuning Wang

Citations: 0

h-index: 0

최근 비전-언어 모델(VLM)의 발전은 강화 학습(RL)을 통해 추론 능력을 향상시키는 데 기여해 왔습니다. 그러나 기존 방법은 여전히 저수준 시각 정보 부족 및 효과적인 시각적 피드백 부재와 같은 중요한 한계를 가지고 있습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 VLM이 저수준 시각적 단서를 통해 "더 멀리 보고(See Further)", 효과적인 시각적 피드백을 통해 "더 깊이 생각하도록(Think Deeper)" 하는 통합적인 다중 모드 교차 추론 프레임워크인 "ForeSight"를 제안합니다. 먼저, 필수적인 시각 정보를 추론 과정에 통합하기 위한 일련의 저수준 시각적 도구를 도입하여 미세한 시각적 특징을 간과하는 문제를 완화합니다. 둘째, 마스크 기반 시각적 피드백 메커니즘을 통해 모델이 동적으로 답변을 재검토하고 업데이트할 수 있도록 시각적 반성을 사고 과정에 통합합니다. RL에 의해 구동되는 ForeSight는 도구 호출 및 답변 검증을 자율적으로 결정하며, 최종 답변 정확도를 보상 신호로 사용합니다. 제안된 프레임워크의 성능을 평가하기 위해, 기존 SalBench 데이터셋을 기반으로 새로운 데이터셋인 Character and Grounding SalBench (CG-SalBench)를 구축했습니다. 실험 결과, ForeSight-7B 모델은 동일한 파라미터 규모의 다른 모델보다 훨씬 우수한 성능을 보였으며, 특정 지표에서는 현재 최고 성능(SOTA)을 보이는 비공개 모델조차 능가했습니다.

Original Abstract

Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!