2605.30011v1 May 28, 2026 cs.CV

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Zheqi Lv

Citations: 20

h-index: 2

Wenqiao Zhang

Citations: 33

h-index: 3

Yang Dai

Citations: 18

h-index: 2

Siliang Tang

Citations: 1,001

h-index: 13

Ming Gao

Citations: 25

h-index: 2

Jiaqi Zhu

Citations: 1

h-index: 1

Zhiqi Ge

Citations: 266

h-index: 7

Zixuan Wan

Citations: 3

h-index: 1

Yueting Zhuang

Citations: 7

h-index: 2

Yuqian Yuan

Citations: 17

h-index: 3

Binhe Yu

Citations: 107

h-index: 2

Haoyuan Zheng

Citations: 123

h-index: 7

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!