2604.09349v1 Apr 10, 2026 cs.CV

시각 정보 기반 정책 최적화를 통한 다중 모드 추론

Visually-Guided Policy Optimization for Multimodal Reasoning

Yong Wang

Citations: 293

h-index: 9

Xiangxiang Chu

Citations: 164

h-index: 7

Feng Xiong

Citations: 94

h-index: 5

Xuecai Hu

Citations: 9

h-index: 2

Liang Lin

Citations: 143

h-index: 7

Yanling Wang

Citations: 355

h-index: 3

Man Zhang

Citations: 15

h-index: 3

Zengbin Wang

Citations: 100

h-index: 4

검증 가능한 보상을 활용한 강화 학습(RLVR)은 시각-언어 모델(VLM)의 추론 능력을 크게 향상시켰습니다. 그러나 VLM의 본질적인 텍스트 중심적인 특성은 종종 시각적 충실도가 부족한 현상을 야기하며, 이는 시각적 토큰에 대한 희소한 어텐션 활성화로 특징지어집니다. 더욱 중요한 점은, 우리의 경험적 분석에 따르면, 추론 단계에 따른 시각적 정보의 소실(temporal visual forgetting)이 이러한 결점을 더욱 악화시킨다는 것을 밝혀냈습니다. 이러한 격차를 해소하기 위해, 우리는 정책 최적화 과정에서 시각적 집중을 강화하는 새로운 프레임워크인 시각 정보 기반 정책 최적화(VGPO)를 제안합니다. 구체적으로, VGPO는 먼저 시각적 유사성을 활용하여 시각적 단서를 찾고 증폭시키는 시각적 어텐션 보정(Visual Attention Compensation) 메커니즘을 도입합니다. 또한, 추론 과정 후반부에는 시각적 기대를 점진적으로 높여 시각적 정보 소실을 방지합니다. 이 메커니즘을 기반으로, 우리는 두 가지 수준의 어드밴티지 재가중 전략을 구현했습니다. 먼저, 경로 내 수준에서는 상대적으로 높은 시각적 활성화를 보이는 토큰을 강조하고, 경로 간 수준에서는 우수한 시각적 정보 축적을 보이는 경로를 우선시합니다. 광범위한 실험 결과, VGPO는 더 나은 시각적 활성화를 달성하고 수학적 다중 모드 추론 및 시각 의존적 작업에서 우수한 성능을 보이는 것을 입증했습니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!