2602.15397v1 Feb 17, 2026 cs.RO

ActionCodec: 좋은 액션 토크나이저란 무엇인가?

ActionCodec: What Makes for Good Action Tokenizers

Zibin Dong

Citations: 260

h-index: 7

Yicheng Liu

Citations: 548

h-index: 5

Shiduo Zhang

Citations: 263

h-index: 6

Baijun Ye

Citations: 114

h-index: 5

Yifu Yuan

Citations: 1,138

h-index: 11

Fei Ni

Citations: 371

h-index: 11

Xipeng Qiu

Citations: 11

h-index: 2

Hang Zhao

Citations: 154

h-index: 6

Yinchuan Li

Citations: 71

h-index: 4

Jianye Hao

Citations: 313

h-index: 8

Jingjing Gong

Citations: 88

h-index: 4

비전-언어-액션(VLA) 모델은 비전-언어 모델(VLM)의 고유한 자기 회귀적 특성을 활용하여 뛰어난 지시사항 준수 능력과 학습 효율성을 보여주었습니다. 이러한 패러다임의 핵심은 액션 토크나이징이지만, 기존의 액션 토크나이저 설계는 주로 재구성 정확도에 초점을 맞추어 VLA 최적화에 미치는 직접적인 영향을 간과했습니다. 결과적으로, extit{좋은 액션 토크나이저의 기준}이라는 근본적인 질문은 여전히 해결되지 않았습니다. 본 논문에서는 VLA 최적화 관점에서 설계 원칙을 확립하여 이러한 간극을 메우고자 합니다. 정보 이론적 통찰력을 바탕으로, 시간적 토큰 중복 최대화, 어휘 중복 최소화, 강화된 다중 모드 상호 정보량, 토큰 독립성과 같은 최적 사례를 제시합니다. 이러한 원칙에 따라, 본 논문에서는 학습 효율성과 VLA 성능을 크게 향상시키는 고성능 액션 토크나이저인 extbf{ActionCodec}을 소개합니다. 특히, 액션 토크나이저를 적용한 SmolVLM2-2.2B 모델은 로봇 관련 사전 학습 없이 LIBERO 데이터셋에서 95.5%의 성공률을 달성했습니다. 더욱 발전된 아키텍처를 적용하면 97.4%의 성능을 달성하여, 로봇 관련 사전 학습이 없는 VLA 모델 분야에서 새로운 최고 수준(SOTA)을 기록했습니다. 본 연구에서 제시된 설계 원칙과 공개된 모델은 커뮤니티가 보다 효과적인 액션 토크나이저를 개발하는 데 명확한 지침을 제공할 것이라고 믿습니다.

Original Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!