2605.30011v1 May 28, 2026 cs.CV

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Zheqi Lv
Zheqi Lv
Citations: 20
h-index: 2
Wenqiao Zhang
Wenqiao Zhang
Citations: 33
h-index: 3
Yang Dai
Yang Dai
Citations: 18
h-index: 2
Siliang Tang
Siliang Tang
Citations: 1,001
h-index: 13
Ming Gao
Ming Gao
Citations: 25
h-index: 2
Jiaqi Zhu
Jiaqi Zhu
Citations: 1
h-index: 1
Zhiqi Ge
Zhiqi Ge
Citations: 266
h-index: 7
Zixuan Wan
Zixuan Wan
Citations: 3
h-index: 1
Yueting Zhuang
Yueting Zhuang
Citations: 7
h-index: 2
Yuqian Yuan
Yuqian Yuan
Citations: 17
h-index: 3
Binhe Yu
Binhe Yu
Citations: 107
h-index: 2
Haoyuan Zheng
Haoyuan Zheng
Citations: 123
h-index: 7

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

0 Citations
0 Influential
6.5 Altmetric
32.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!