2605.30280v1 May 28, 2026 cs.RO

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Zixing Lei
Zixing Lei
Citations: 723
h-index: 6
Jian Guan
Jian Guan
Citations: 171
h-index: 5
Tong Zhang
Tong Zhang
Citations: 41
h-index: 4
Mingsheng Li
Mingsheng Li
Citations: 1,263
h-index: 2
Junyang Lin
Junyang Lin
Citations: 7,191
h-index: 10
Zhixuan Liang
Zhixuan Liang
Citations: 797
h-index: 10
Yiyang Huang
Yiyang Huang
Northeastern University
Citations: 23
h-index: 2
Ji-lu Ye
Ji-lu Ye
Citations: 67
h-index: 3
Shuai Bai
Shuai Bai
Citations: 4
h-index: 1
Yuchong Sun
Yuchong Sun
Citations: 867
h-index: 9
Sicheng Xie
Sicheng Xie
Citations: 114
h-index: 4
Dayiheng Liu
Dayiheng Liu
Citations: 105
h-index: 2
Xuhong Huang
Xuhong Huang
Citations: 2
h-index: 1
Yitao Liu
Yitao Liu
Citations: 208
h-index: 3
Junhao Chen
Junhao Chen
Citations: 120
h-index: 6
Yingming Zheng
Yingming Zheng
Citations: 1
h-index: 1
Qiuyue Wang
Qiuyue Wang
Citations: 11
h-index: 1
Xintong Hu
Xintong Hu
Citations: 0
h-index: 0
Pei Lin
Pei Lin
Citations: 24
h-index: 3
Jiazhao Zhang
Jiazhao Zhang
Citations: 1,493
h-index: 19
Haoqi Yuan
Haoqi Yuan
Citations: 551
h-index: 12
G. Zhou
G. Zhou
Citations: 860
h-index: 8
Hang Yin
Hang Yin
Citations: 243
h-index: 5
Yebin Wang
Yebin Wang
Citations: 4
h-index: 1
Wujian Peng
Wujian Peng
Citations: 104
h-index: 4
Delin Chen
Delin Chen
Citations: 225
h-index: 8
Jingyang Fan
Jingyang Fan
Citations: 24
h-index: 2
Xianwei Zhuang
Xianwei Zhuang
Citations: 309
h-index: 12
Xinyu Zhou
Xinyu Zhou
Citations: 13
h-index: 2
Haoyang Li
Haoyang Li
Citations: 848
h-index: 7
An-Jen Chen
An-Jen Chen
Citations: 52
h-index: 4
Xuejing Liu
Xuejing Liu
Citations: 10,498
h-index: 5
Rui Chen
Rui Chen
Citations: 213
h-index: 5
Chenxu Lu
Chenxu Lu
Citations: 22
h-index: 3
Tao Yu
Tao Yu
Citations: 2
h-index: 1
Xiong-hui Chen
Xiong-hui Chen
Citations: 799
h-index: 4
Jie Zhang
Jie Zhang
Citations: 12
h-index: 2
Jing Zhou
Jing Zhou
Citations: 9
h-index: 1
Zhao Li
Zhao Li
Citations: 7
h-index: 2
Zhibo Yang
Zhibo Yang
Citations: 8,292
h-index: 26

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

1 Citations
0 Influential
13 Altmetric
66.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!