2603.01465v1 Mar 02, 2026 cs.RO

키프레임 체이닝을 이용한 장기 로봇 조작: 비마르코프 특성 고려

Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

H. Shen

Citations: 35

h-index: 2

Wentao Tan

Citations: 178

h-index: 5

Lei Zhu

Citations: 14

h-index: 2

Fengling Li

Citations: 472

h-index: 12

Jingjing Li

Citations: 13

h-index: 1

Guoli Yang

Citations: 42

h-index: 2

Yipeng Chen

Citations: 15

h-index: 2

기존의 비전-언어-행동(VLA) 모델은 즉각적인 관찰에 크게 의존하기 때문에 장기적인 작업으로 확장하는 데 어려움을 겪는 경우가 많습니다. 최근 연구에서는 절차적 작업을 처리하기 위해 검색 메커니즘을 도입하거나 컨텍스트 창을 확장하려는 시도가 있었지만, 최적의 행동이 현재 관찰이 아닌 특정 과거 상태에만 의존하는 비마르코프 의존성을 포착하는 데 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 장기적인 의존성을 모델링하기 위해 중요한 과거 프레임을 추출하고 연결하는 프레임워크인 키프레임 체이닝 VLA를 제안합니다. 구체적으로, 우리는 판별적인 임베딩 공간을 학습하여 뚜렷한 상태 전환을 효과적으로 식별하는 자동 키프레임 선택기를 제안합니다. 작업에 중요한 정보를 캡처하기 위해, 현재 실행 단계와 시간적 관련성이 높은 과거 프레임을 동적으로 검색하는 진행 상황 인지 쿼리 메커니즘을 설계했습니다. 선택된 키프레임은 VLA에 시각적 토큰으로 통합되어 정책을 장기적인 시간적 맥락에 명시적으로 연결합니다. 마지막으로, ManiSkill 시뮬레이터를 기반으로 구축된 네 가지 비마르코프 조작 작업을 사용하여 작업 성공률을 측정합니다. 실험 결과는 제안된 방법이 우수한 성능을 달성하며, 장기적인 시간적 의존성을 특징으로 하는 로봇 조작 작업을 효과적으로 처리한다는 것을 보여줍니다. 코드는 https://github.com/cytoplastm/KC-VLA 에서 확인할 수 있습니다.

Original Abstract

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.

0 Citations

0 Influential

26 Altmetric

130.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!