2605.26520v1 May 26, 2026 cs.CV

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Wei Liu
Wei Liu
Citations: 26
h-index: 3
Zhiwei Ning
Zhiwei Ning
Citations: 22
h-index: 2
Lewei Lu
Lewei Lu
Citations: 3
h-index: 1
Jie Yang
Jie Yang
Citations: 22
h-index: 2
J. Ni
J. Ni
Citations: 83
h-index: 4
Hanming Deng
Hanming Deng
Citations: 1,312
h-index: 13
Wenwen Tong
Wenwen Tong
Citations: 1,791
h-index: 7
Xiang Kong
Xiang Kong
Citations: 240
h-index: 5
Shengnan Ma
Shengnan Ma
Citations: 36
h-index: 4
Ziyi Shang
Ziyi Shang
Citations: 0
h-index: 0
Tao Hu
Tao Hu
Citations: 9
h-index: 2
Yong Xien Chng
Yong Xien Chng
Citations: 93
h-index: 4
Jixuan Ying
Jixuan Ying
Citations: 88
h-index: 4
Zehuan Wu
Zehuan Wu
Citations: 75
h-index: 3
Yuan-Lei Zheng
Yuan-Lei Zheng
Citations: 8
h-index: 2

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

0 Citations
0 Influential
6.5 Altmetric
32.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!