2603.29844v1 Mar 31, 2026 cs.RO

DIAL: 잠재적 세계 모델링을 통한 의도와 행동의 분리 – 엔드 투 엔드 비전-언어-액션 시스템

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen

Citations: 349

h-index: 9

Hui Zhou

Citations: 1

h-index: 1

Mingyu Ding

Citations: 159

h-index: 5

Yuying Ge

Citations: 4,479

h-index: 28

Yixiao Ge

Citations: 479

h-index: 12

Xihui Liu

Citations: 548

h-index: 8

비전-언어-액션(VLA) 모델의 개발은 사전 훈련된 비전-언어 모델(VLM)의 발전으로 크게 가속화되었습니다. 그러나, 대부분의 기존 엔드 투 엔드 VLA 모델은 VLM을 주로 다중 모달 인코더로 활용하여 시각 및 언어 특징을 저수준 액션으로 직접 매핑합니다. 이러한 방식은 VLM의 고수준 의사 결정 능력을 충분히 활용하지 못하며, 훈련 불안정을 야기하여 풍부한 의미론적 표현을 저하시키는 경향이 있습니다. 이러한 한계점을 해결하기 위해, 우리는 고수준 의사 결정과 저수준 모터 실행을 연결하는 프레임워크인 DIAL을 제안합니다. DIAL은 미분 가능한 잠재적 의도 병목 구조를 통해 작동합니다. 구체적으로, VLM 기반의 System-2는 VLM의 고유한 특징 공간 내에서 잠재적인 시각적 예측을 수행하여 잠재적 세계 모델링을 수행하며, 이 예측은 명시적으로 의도를 인코딩하고 구조적 병목 역할을 합니다. 경량화된 System-1 정책은 이 예측된 의도와 현재 관찰 데이터를 결합하여 잠재적 역동학을 통해 정밀한 로봇 액션을 생성합니다. 최적화 안정성을 확보하기 위해, 우리는 두 단계의 훈련 방법을 사용합니다. 첫 번째 단계에서는 System-2가 잠재적 미래를 예측하도록 학습하고, System-1은 통합된 특징 공간 내에서 실제 미래 지침에 따라 모터 제어를 학습합니다. 두 번째 단계에서는 엔드 투 엔드 방식으로 시스템을 공동으로 최적화합니다. 이를 통해 액션 정보를 활용한 그래디언트가 VLM의 핵심 구조를 제어된 방식으로 개선하여 사전 훈련된 지식을 보존할 수 있습니다. RoboCasa GR1 테이블탑 벤치마크에서 수행한 광범위한 실험 결과, DIAL은 새로운 최고 성능을 달성했으며, 기존 방법보다 10배 적은 데모 데이터로도 우수한 성능을 보였습니다. 또한, DIAL은 다양한 인간 데모 데이터를 활용하여 물리적으로 기반한 조작 능력을 학습하고, 실제 로봇 환경에서 인간형 로봇을 사용하여 새로운 객체와 설정에 대해 강력한 제로샷 일반화 능력을 보여줍니다.

Original Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

0 Citations

0 Influential

14 Altmetric

70.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!