2602.20200v1 Feb 22, 2026 cs.RO

전역 사전 정보와 국소 일관성의 결합: 효율적인 로봇 조작을 위한 이중 메모리 증강 시각-언어-행동 모델

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Gongwei Chen

Citations: 395

h-index: 10

Rui Shao

Citations: 513

h-index: 12

Zaijing Li

Citations: 150

h-index: 6

Bin Hu

Citations: 184

h-index: 5

D. Jiang

Citations: 142

h-index: 5

Pengwei Xie

Citations: 2

h-index: 1

Jianye Hao

Citations: 42

h-index: 3

Liqiang Nie

Citations: 603

h-index: 15

계층적 시각-언어-행동(VLA) 모델은 로봇 조작을 위한 지배적인 패러다임으로 빠르게 자리 잡고 있다. 이 모델은 일반적으로 인식 및 이해를 위한 시각-언어 백본과 행동 생성을 위한 생성형 정책으로 구성된다. 그러나, 이 모델의 성능은 행동 생성 과정에 의해 점점 더 병목 현상을 겪고 있다. (i) 낮은 추론 효율성: 등방성 노이즈 사전 분포(isotropic noise priors)와 목표 행동 분포 사이의 뚜렷한 분포 차이로 인해 디노이징(denoising) 단계와 실행 불가능한 샘플의 발생 빈도가 증가한다. (ii) 부족한 강건성: 기존 정책들은 오직 현재 관측에만 조건을 부여하고 과거 시퀀스의 제약을 무시하므로 작업 진행 상황과 시간적 일관성에 대한 인식이 부족하다. 이러한 문제들을 해결하기 위해, 우리는 전역 사전 정보 메모리(GPM)와 국소 일관성 메모리(LCM)를 갖춘 이중 메모리 VLA 프레임워크인 OptimusVLA를 도입한다. GPM은 가우시안 노이즈를 의미론적으로 유사한 궤적에서 검색된 작업 수준의 사전 정보로 대체함으로써 생성 경로를 단축하고 함수 평가 횟수(NFE)를 줄인다. LCM은 실행된 행동 시퀀스를 동적으로 모델링하여 작업 진행 상황을 추론하고, 궤적의 시간적 일관성과 매끄러움을 강제하는 학습된 일관성 제약을 주입한다. 세 가지 시뮬레이션 벤치마크에 걸쳐 OptimusVLA는 강력한 베이스라인들을 일관되게 능가한다. LIBERO에서 98.6%의 평균 성공률을 달성하고, CALVIN에서 pi_0보다 13.5% 성능이 향상되었으며, RoboTwin 2.0 Hard에서 38%의 평균 성공률을 기록했다. 실제 환경(Real-World) 평가에서 OptimusVLA는 일반화 및 장기(Long-horizon) 테스트 스위트에서 1위를 차지하며 pi_0를 각각 42.9% 및 52.4% 능가하는 동시에 2.9배의 추론 속도 향상을 제공한다.

Original Abstract

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!