2605.26520v1 May 26, 2026 cs.CV

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Wei Liu

Citations: 26

h-index: 3

Zhiwei Ning

Citations: 22

h-index: 2

Lewei Lu

Citations: 3

h-index: 1

Jie Yang

Citations: 22

h-index: 2

J. Ni

Citations: 83

h-index: 4

Hanming Deng

Citations: 1,312

h-index: 13

Wenwen Tong

Citations: 1,791

h-index: 7

Xiang Kong

Citations: 240

h-index: 5

Shengnan Ma

Citations: 36

h-index: 4

Ziyi Shang

Citations: 0

h-index: 0

Tao Hu

Citations: 9

h-index: 2

Yong Xien Chng

Citations: 93

h-index: 4

Jixuan Ying

Citations: 88

h-index: 4

Zehuan Wu

Citations: 75

h-index: 3

Yuan-Lei Zheng

Citations: 8

h-index: 2

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!