2604.11751v1 Apr 13, 2026 cs.RO

의미론적으로 일반화 가능한 계획을 위한 기반 세계 모델

Grounded World Model for Semantically Generalizable Planning

Lang Feng

Citations: 515

h-index: 9

Letian Wang

Citations: 1,319

h-index: 10

Quanyi Li

Citations: 1,057

h-index: 13

Haonan Zhang

Citations: 406

h-index: 12

Alexandre Alahi

Citations: 4

h-index: 2

Harold Soh

Citations: 145

h-index: 6

Wuyang Li

Citations: 40

h-index: 3

모델 예측 제어(MPC)에서, 세계 모델은 다양한 행동 제안의 미래 결과를 예측하며, 이러한 예측은 최적의 행동을 선택하도록 안내하기 위해 점수로 평가됩니다. 시각-운동 MPC에서, 점수 함수는 사전 훈련된 비전 인코더(예: DINO, JEPA)의 잠재 공간에서 예측된 이미지와 목표 이미지 간의 거리 측정 값입니다. 그러나 특히 새로운 환경에서 작업 수행 전에 목표 이미지를 얻는 것은 어렵습니다. 또한, 이미지를 통해 목표를 전달하는 방식은 자연어에 비해 상호 작용성이 제한적입니다. 본 연구에서는 비전-언어-정렬된 잠재 공간에서 기반 세계 모델(GWM)을 학습하는 방법을 제안합니다. 결과적으로, 제안된 각 행동은 미래 결과가 작업 지침과 얼마나 가까운지에 따라 점수가 매겨지는데, 이는 임베딩의 유사성을 통해 반영됩니다. 이 접근 방식은 시각-운동 MPC를 VLA(Vision-Language Alignment)로 변환하며, 이는 의미론적 일반화 측면에서 VLM(Vision-Language Model) 기반 VLA보다 우수합니다. 제안하는 WISER 벤치마크에서, GWM-MPC는 훈련 중에 시연된 동작으로 해결할 수 있지만, 새로운 시각적 신호와 지칭 표현을 포함하는 288개의 작업으로 구성된 테스트 세트에서 87%의 성공률을 달성했습니다. 반면에, 기존의 VLA는 훈련 세트에 과적합되어 90%의 성공률을 보이지만, 평균 성공률은 22%에 불과합니다.

Original Abstract

In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!