2606.05979v1 Jun 04, 2026 cs.RO

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Pengfei Liu

Citations: 528

h-index: 7

Siqi Kou

Citations: 293

h-index: 9

Zhijie Wei

Citations: 0

h-index: 0

Xiaowu Xia

Citations: 1

h-index: 1

Zhijie Deng

Citations: 88

h-index: 4

Yi Yang

Citations: 721

h-index: 11

Zhihong Liu

Citations: 70

h-index: 1

Yiyang Chen

Citations: 8

h-index: 1

Yanzhe Hu

Citations: 8

h-index: 2

Jianbo Zhou

Citations: 2

h-index: 1

Bo Zhao

Citations: 1

h-index: 1

Xueqi Li

Citations: 15

h-index: 2

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!