2605.28160v1 May 27, 2026 cs.AI

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Jiayi Ji
Jiayi Ji
Citations: 983
h-index: 16
Xiaoshuai Sun
Xiaoshuai Sun
Citations: 1,235
h-index: 20
Rongrong Ji
Rongrong Ji
Citations: 1,027
h-index: 17
Rui Zhao
Rui Zhao
Citations: 7
h-index: 1
Yidong Chen
Yidong Chen
Citations: 116
h-index: 5
Wujin Sun
Wujin Sun
Citations: 0
h-index: 0
Qianzhi Chen
Qianzhi Chen
Citations: 4
h-index: 1
Yang Zhang
Yang Zhang
Citations: 5
h-index: 2

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

0 Citations
0 Influential
10 Altmetric
50.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!