2603.23149v1 Mar 24, 2026 cs.AI

설명-후-행동: 증류된 언어-행동 세계 모델을 이용한 선제적 에이전트 제어

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Stéphane Lathuilière

Citations: 553

h-index: 9

Massimiliano Pappa

Citations: 6

h-index: 1

Luca Romani

Citations: 26

h-index: 2

Valentino Sacco

Citations: 4

h-index: 1

Alessio Palma

Citations: 19

h-index: 2

Fabio Galasso

Citations: 129

h-index: 5

Xavier Alameda-Pineda

Citations: 1

h-index: 1

I. Spinelli

Citations: 0

h-index: 0

안전이 중요한 에이전트를 운영하기 위해서는 행동이 실행되기 전에 그 결과를 예측해야 합니다. 세계 모델은 이러한 선제적 예측을 위한 패러다임을 제공하지만, 현재의 시각 시뮬레이션을 기반으로 하는 방식은 종종 몇 초 이상의 지연 시간을 초래합니다. 본 연구에서는 시각 처리가 오류 방지에 필수적이라는 가정에 도전합니다. 우리는 학습된 정책의 잠재 상태와 계획된 행동이 이미 행동 결과를 예측하기에 충분한 정보를 포함하고 있으며, 따라서 오류 방지를 위해 시각 시뮬레이션이 불필요하다는 것을 보여줍니다. 이를 위해, 우리는 DILLO (DIstiLLed Language-ActiOn World Model)라는 빠른 제어 레이어를 소개합니다. DILLO는 "시뮬레이션-후-행동" 패러다임을 "설명-후-행동" 패러다임으로 전환합니다. DILLO는 크로스 모달 증류를 통해 학습됩니다. 여기서, 비전 언어 모델(Vision Language Model)이 오프라인 트랙제를어하고, 잠재 상태에 조건화된 대규모 언어 모델(Large Language Model)이 의미론적 결과를 예측하도록 학습됩니다. 이를 통해, 무거운 시각 생성 과정을 완전히 우회하는 텍스트 기반 추론 경로를 생성하여, 기존 방식보다 14배 빠른 속도를 달성합니다. MetaWorld 및 LIBERO에서의 실험 결과, DILLO는 다음 상태에 대한 고정밀 설명을 생성하고, 정책을 제어하여 에피소드 성공률을 평균적으로 15%p, 일부 작업에서는 9.3%p 향상시킬 수 있음을 보여줍니다.

Original Abstract

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!