2605.30280v1 May 28, 2026 cs.RO

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Zixing Lei

Citations: 723

h-index: 6

Jian Guan

Citations: 171

h-index: 5

Tong Zhang

Citations: 41

h-index: 4

Mingsheng Li

Citations: 1,263

h-index: 2

Junyang Lin

Citations: 7,191

h-index: 10

Zhixuan Liang

Citations: 797

h-index: 10

Yiyang Huang

Northeastern University

Citations: 23

h-index: 2

Ji-lu Ye

Citations: 67

h-index: 3

Shuai Bai

Citations: 4

h-index: 1

Yuchong Sun

Citations: 867

h-index: 9

Sicheng Xie

Citations: 114

h-index: 4

Dayiheng Liu

Citations: 105

h-index: 2

Xuhong Huang

Citations: 2

h-index: 1

Yitao Liu

Citations: 208

h-index: 3

Junhao Chen

Citations: 120

h-index: 6

Yingming Zheng

Citations: 1

h-index: 1

Qiuyue Wang

Citations: 11

h-index: 1

Xintong Hu

Citations: 0

h-index: 0

Pei Lin

Citations: 24

h-index: 3

Jiazhao Zhang

Citations: 1,493

h-index: 19

Haoqi Yuan

Citations: 551

h-index: 12

G. Zhou

Citations: 860

h-index: 8

Hang Yin

Citations: 243

h-index: 5

Yebin Wang

Citations: 4

h-index: 1

Wujian Peng

Citations: 104

h-index: 4

Delin Chen

Citations: 225

h-index: 8

Jingyang Fan

Citations: 24

h-index: 2

Xianwei Zhuang

Citations: 309

h-index: 12

Xinyu Zhou

Citations: 13

h-index: 2

Haoyang Li

Citations: 848

h-index: 7

An-Jen Chen

Citations: 52

h-index: 4

Xuejing Liu

Citations: 10,498

h-index: 5

Rui Chen

Citations: 213

h-index: 5

Chenxu Lu

Citations: 22

h-index: 3

Tao Yu

Citations: 2

h-index: 1

Xiong-hui Chen

Citations: 799

h-index: 4

Jie Zhang

Citations: 12

h-index: 2

Jing Zhou

Citations: 9

h-index: 1

Zhao Li

Citations: 7

h-index: 2

Zhibo Yang

Citations: 8,292

h-index: 26

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

1 Citations

0 Influential

13 Altmetric

66.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!