2603.16666v1 Mar 17, 2026 cs.CV

Fast-WAM: 월드 액션 모델은 테스트 시 미래 예측 기능이 정말 필요한가?

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Zibin Dong

Citations: 260

h-index: 7

Yicheng Liu

Citations: 548

h-index: 5

Tianyuan Yuan

Citations: 544

h-index: 9

Hang Zhao

Citations: 80

h-index: 4

월드 액션 모델(WAM)은 시각-언어-행동(VLA) 모델의 유망한 대안으로, 시각적 관찰이 행동에 따라 어떻게 변화하는지를 명시적으로 모델링하여 로봇 제어에 활용됩니다. 대부분의 기존 WAM은 '예측 후 실행' 패러다임을 따르며, 반복적인 비디오 노이즈 제거 과정에서 상당한 테스트 시간 지연이 발생하지만, 명시적인 미래 예측이 실제로 강력한 행동 성능에 필수적인지는 불분명합니다. 본 논문에서는 WAM이 테스트 시 명시적인 미래 예측 기능이 필요한지, 아니면 주로 학습 단계에서의 비디오 모델링 덕분인지 질문합니다. 우리는 학습 단계의 비디오 모델링과 추론 단계의 명시적인 미래 생성의 역할을 분리하기 위해, 학습 시 비디오 공동 학습은 유지하지만 테스트 시 미래 예측을 생략하는 WAM 아키텍처인 extbf{Fast-WAM}을 제안합니다. 또한, 이 두 가지 요소를 비교하기 위해 다양한 Fast-WAM 변형을 구현했습니다. 실험 결과, Fast-WAM은 '예측 후 실행' 변형과 경쟁력 있는 성능을 유지하는 반면, 비디오 공동 학습을 제거하면 성능이 훨씬 더 크게 저하됩니다. 실제로, Fast-WAM은 사전 훈련 없이도 시뮬레이션 벤치마크(LIBERO 및 RoboTwin) 및 실제 작업에서 최첨단 방법과 경쟁력 있는 결과를 달성했으며, 기존의 '예측 후 실행' WAM보다 4배 이상 빠른 190ms의 지연 시간으로 실시간으로 작동합니다. 이러한 결과는 WAM에서 비디오 예측의 주요 가치가 테스트 시 미래 관찰을 생성하는 것보다 학습 과정에서 세계 표현을 개선하는 데 있다는 것을 시사합니다. 프로젝트 페이지: https://yuantianyuan01.github.io/FastWAM/

Original Abstract

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/

9 Citations

4 Influential

4.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!