2603.27670v1 Mar 29, 2026 cs.RO

ProgressVLA: 진행 상황 기반 확산 정책을 활용한 시각-언어 로봇 조작

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Jiaolong Yang

Citations: 49

h-index: 3

Hongyu Yan

Citations: 6

h-index: 2

Qiwei Li

Citations: 126

h-index: 6

Yadong Mu

Citations: 36

h-index: 4

대부분의 기존 시각-언어-행동(VLA) 모델은 로봇 조작을 위해 설계되었지만, 진행 상황 인식이 부족하며, 일반적으로 작업 종료를 위한 수동으로 설계된 휴리스틱에 의존합니다. 이러한 제한은 특히 연쇄적인 하위 목표를 포함하는 장기 작업에서 더욱 심각합니다. 본 연구에서는 작업 진행 상황의 추정과 통합을 조사하고, 새로운 모델인 { extbf extbf ext{VLA}}를 제안합니다. 본 연구의 기술적 기여는 다음과 같습니다. (1) extit{강건한 진행 상황 추정}: 대규모의 비지도 비디오-텍스트 로봇 데이터셋을 사용하여 진행 상황 추정기를 사전 학습시켰습니다. 이 추정기는 시뮬레이션 환경에서 낮은 예측 잔차(0.07, 범위 [0, 1])를 달성했으며, 실제 환경의 새로운 샘플에 대한 제로샷 일반화 성능을 보여줍니다. (2) extit{미분 가능한 진행 상황 기반 가이드}: 예측된 액션 토큰을 미래의 잠재 시각 상태로 매핑하는 역동적 세계 모델을 도입했습니다. 이러한 잠재 상태는 진행 상황 추정기에 의해 처리되며, 최대 진행 상황 정규화를 적용하여, 액션 토큰을 개선하기 위한 진행 상황 기반 가이드를 제공하는 미분 가능한 파이프라인을 구축했습니다. CALVIN 및 LIBERO 벤치마크에서의 광범위한 실험과 실제 로봇 배포 결과, 강력한 기준 모델 대비 성공률 및 일반화 성능이 크게 향상되었습니다.

Original Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!