2605.28160v1 May 27, 2026 cs.AI

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Jiayi Ji

Citations: 983

h-index: 16

Xiaoshuai Sun

Citations: 1,235

h-index: 20

Rongrong Ji

Citations: 1,027

h-index: 17

Rui Zhao

Citations: 7

h-index: 1

Yidong Chen

Citations: 116

h-index: 5

Wujin Sun

Citations: 0

h-index: 0

Qianzhi Chen

Citations: 4

h-index: 1

Yang Zhang

Citations: 5

h-index: 2

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!