2606.05979v1 Jun 04, 2026 cs.RO

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Pengfei Liu
Pengfei Liu
Citations: 528
h-index: 7
Siqi Kou
Siqi Kou
Citations: 293
h-index: 9
Zhijie Wei
Zhijie Wei
Citations: 0
h-index: 0
Xiaowu Xia
Xiaowu Xia
Citations: 1
h-index: 1
Zhijie Deng
Zhijie Deng
Citations: 88
h-index: 4
Yi Yang
Yi Yang
Citations: 721
h-index: 11
Zhihong Liu
Zhihong Liu
Citations: 70
h-index: 1
Yiyang Chen
Yiyang Chen
Citations: 8
h-index: 1
Yanzhe Hu
Yanzhe Hu
Citations: 8
h-index: 2
Jianbo Zhou
Jianbo Zhou
Citations: 2
h-index: 1
Bo Zhao
Bo Zhao
Citations: 1
h-index: 1
Xueqi Li
Xueqi Li
Citations: 15
h-index: 2

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

0 Citations
0 Influential
5.5 Altmetric
27.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!