2603.14371v1 Mar 15, 2026 cs.RO

OxyGen: 다중 작업 병렬 처리 환경에서 비전-언어-행동 모델을 위한 통합 키-값 캐시 관리

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Xin Ding

Citations: 21

h-index: 2

Xiangyu Li

Institue for AI Industry Research (AIR), Tsinghua University

Citations: 438

h-index: 4

Weijun Wang

Citations: 3,615

h-index: 7

Ting Cao

Citations: 4

h-index: 1

Yunxin Liu

Citations: 6

h-index: 2

Huaizhi Tang

Citations: 0

h-index: 0

현재의 에이전트 인공지능 시스템은 조작, 대화, 기억 구축과 같은 다양한 작업을 동시에 처리해야 하며, 이는 공유된 관찰 데이터를 기반으로 서로 다른 시간 제약 조건 하에서 수행됩니다. 최근의 Mixture-of-Transformers (MoT) 기반 비전-언어-행동 모델(VLA)은 이러한 이질적인 출력을 지원하지만, 기존의 추론 시스템은 중복 계산과 자원 경쟁으로 인해 온-디바이스 배포 환경에서 효율적인 다중 작업 병렬 처리를 달성하지 못합니다. 본 연구에서는 이러한 문제의 근본 원인을 독립적인 키-값 캐시 관리로 파악했습니다. 이를 해결하기 위해, 키-값 캐시를 작업과 시간에 따라 공유되는 주요 자원으로 취급하는 통합 키-값 캐시 관리라는 새로운 추론 패러다임을 제안합니다. 이러한 추상화는 다음과 같은 두 가지 핵심 최적화를 가능하게 합니다. 첫째, 작업 간 키-값 공유를 통해 공유된 관찰 데이터의 불필요한 프리필 작업을 제거합니다. 둘째, 프레임 간 연속적인 배치 처리를 통해, 제어 주기 동안 고정된 속도의 행동 생성과 가변 길이 언어 디코딩을 분리합니다. 본 연구에서는 가장 널리 사용되는 MoT VLA 모델인 $π_{0.5}$에 이러한 패러다임을 구현하고, 다양한 로봇 구성 환경에서 성능을 평가했습니다. OxyGen은 독립적인 실행 방식에 비해 최대 3.7배의 속도 향상을 달성했으며, 언어 처리량은 초당 200 토큰 이상, 행동 빈도는 70Hz 이상으로, 동시에 행동 품질 저하 없이 높은 성능을 제공합니다.

Original Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $π_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!