2605.06192v1 May 07, 2026 cs.CV

EA-WM: 이벤트 기반 생성적 세계 모델: 구조화된 운동-시각 상호작용 필드를 활용

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

Cong Huang

Citations: 36

h-index: 4

Kai Chen

Citations: 96

h-index: 6

Lizhe Qi

Citations: 32

h-index: 3

Zhaoyang Yang

Citations: 584

h-index: 6

Yurun Jin

Citations: 8

h-index: 2

사전에 학습된 비디오 확산 모델은 강력한 시공간 생성 능력을 제공하며, 로봇 세계 모델의 자연스러운 기반이 됩니다. 최근의 세계-행동 모델은 미래 비디오와 행동을 함께 최적화하지만, 비디오 생성을 주로 정책 학습을 위한 보조 표현으로 취급합니다. 결과적으로, 행동 신호를 활용하여 비디오 합성을 안내하는 역방향 문제를 충분히 탐색하지 못하여, 생성된 결과물에서 로봇의 정확한 공간 기하학 정보와 정밀한 로봇-물체 상호작용 역학을 제대로 반영하지 못하는 경우가 많습니다. 이러한 격차를 해소하기 위해, 우리는 이벤트 기반 생성적 세계 모델인 EA-WM을 제안합니다. EA-WM은 관절 또는 엔드 이펙터 행동을 추상적인 저차원 토큰으로 주입하는 대신, 행동과 운동 상태를 직접적으로 목표 카메라 뷰에 구조화된 운동-시각 상호작용 필드로 투영합니다. 이러한 기하학적으로 정립된 표현을 최대한 활용하기 위해, 우리는 이벤트 기반의 양방향 융합 블록을 도입하여 교차 분기 주의 메커니즘을 조절하고, 객체 상태 변화와 상호작용 역학을 포착합니다. 포괄적인 WorldArena 벤치마크에서 평가한 결과, EA-WM은 기존 모델보다 현저하게 우수한 성능을 달성하여 최첨단 수준의 결과를 보여주었습니다.

Original Abstract

Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!