2603.04910v1 Mar 05, 2026 cs.RO

VPWEM: 작업 및 에피소드 기억을 갖춘 비마르코프 시각운동 정책

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Yuheng Lei

Citations: 19

h-index: 2

Zhixuan Liang

Citations: 610

h-index: 10

Hongyuan Zhang

Citations: 23

h-index: 2

Ping Luo

Citations: 472

h-index: 8

인간의 시연을 통한 모방 학습은 로봇 제어 분야에서 상당한 성공을 거두었지만, 대부분의 시각운동 정책은 여전히 단일 단계의 관찰 또는 짧은 문맥 정보를 기반으로 하기 때문에 장기적인 기억이 필요한 비마르코프 작업에서 어려움을 겪습니다. 단순히 문맥 창 크기를 늘리면 상당한 계산 및 메모리 비용이 발생하고, 잘못된 상관관계에 대한 과적합을 유발하여 데이터 분포가 변경될 경우 심각한 오류를 초래하며, 로봇 시스템의 실시간 제약 조건을 위반할 수 있습니다. 반면, 인간은 중요한 과거 경험을 장기 기억으로 압축하고 이를 활용하여 평생 동안 다양한 작업을 수행합니다. 본 논문에서는 작업 및 에피소드 기억을 갖춘 비마르코프 시각운동 정책인 VPWEM을 제안합니다. VPWEM은 최근 관찰 정보를 일시적인 작업 기억으로 유지하고, 트랜스포머 기반의 문맥 기억 압축기를 도입하여 창 밖의 관찰 정보를 고정된 수의 에피소드 기억 토큰으로 변환합니다. 이 압축기는 과거 요약 토큰의 캐시를 사용한 자기 주의(self-attention)와 과거 관찰 정보의 캐시를 사용한 교차 주의(cross-attention)를 활용하며, 정책과 함께 공동으로 학습됩니다. 우리는 VPWEM을 확산 정책에 적용하여 단기 정보와 전체 에피소드 정보를 모두 활용하여 거의 일정한 메모리와 계산 비용으로 행동을 생성합니다. 실험 결과, VPWEM은 확산 정책 및 시각-언어-행동(VLA) 모델을 포함한 최첨단 모델보다 MIKASA의 메모리 집약적인 조작 작업에서 20% 이상, MoMaRT 모바일 조작 벤치마크에서 평균 5% 향상된 성능을 보였습니다. 관련 코드는 https://github.com/HarryLui98/code_vpwem 에서 확인할 수 있습니다.

Original Abstract

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

0 Citations

0 Influential

30.493061443341 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!