2605.07288v1 May 08, 2026 cs.CV

Sword: 동적 잠재 부트스트래핑을 통한 스타일-강건한 월드 모델: VLA 정책 후처리 시뮬레이터

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Yongjian Guo

Citations: 268

h-index: 2

Sheng Wen

Citations: 3

h-index: 1

Junwu Xiong

Citations: 244

h-index: 2

Wenxuan Huang

Citations: 102

h-index: 6

Jiaxuan Gao

Citations: 8

h-index: 2

Zhong Guan

Citations: 6

h-index: 1

Wanlun Ma

Citations: 779

h-index: 12

Xinquan Xiao

Citations: 43

h-index: 2

비전-언어-행동(VLA) 모델과 월드 모델의 통합은 점점 더 많은 관심을 받고 있습니다. 대표적인 접근 방식 중 하나는 학습된 월드 모델을 생성적인 시뮬레이터로 취급하여, '상상' 내에서 정책 최적화를 수행하는 것입니다. 그러나 LIBERO 벤치마크와 같은 특정 환경에 시뮬레이터로 사용될 때, 기존의 월드 모델은 종종 낮은 일반화 성능과 장기적인 오차 누적 문제를 겪습니다. 이러한 모델은 초기 상태의 작은 변화에 매우 민감하며, 색상, 조명 및 기타 시각적 요소의 미세한 변화로 인해 연쇄적인 환각 현상이 발생하여 심각한 흐릿함이나 과다 노출을 초래할 수 있습니다. 또한, 장기적인 오차 누적은 예측된 미래 상태의 품질과 정확성을 더욱 저하시킵니다. 이러한 문제는 월드 모델을 시뮬레이터로 사용하는데 있어 신뢰성을 제한합니다. 이러한 문제점을 해결하기 위해, 우리는 강력한 월드 모델 프레임워크인 Sword를 제안합니다. 우리의 방법은 구조 기반 스타일 증강을 통해 상호 작용 환경의 시각적 질감을 작업 관련 역학적 요소와 분리하여 일반화 성능을 향상시킵니다. 또한, 동적 잠재 부트스트래핑을 통해 학습 및 추론 간의 일관성을 유지하면서 메모리 소비를 최소화합니다. LIBERO 벤치마크에 대한 광범위한 실험 결과, 우리의 방법이 일반화 성능, 생성 품질, 강건성, 정확성 및 VLA 모델의 강화 학습 후처리 성공률 측면에서 기존의 WoVR 모델보다 훨씬 우수한 성능을 보이는 것을 확인했습니다.

Original Abstract

The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.

1 Citations

0 Influential

6 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!