2604.11757v1 Apr 13, 2026 cs.RO

StarVLA-$α$: 비전-언어-액션 시스템의 복잡성 감소

StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

Zixuan Wang

Citations: 32

h-index: 3

Shu Liu

Citations: 278

h-index: 8

Ning Gao

Citations: 50

h-index: 3

Senqiao Yang

Citations: 579

h-index: 8

Yuxin Chen

Citations: 26

h-index: 1

Yilun Chen

Citations: 27

h-index: 2

Jinliang Zheng

Air, Tsinghua University

Citations: 443

h-index: 9

Ji-lu Ye

Citations: 27

h-index: 2

Pengguang Chen

Citations: 1,538

h-index: 15

Jiaya Jia

Citations: 371

h-index: 11

최근 비전-언어-액션(VLA) 모델은 범용 로봇 에이전트 구축을 위한 유망한 패러다임으로 부상했습니다. 그러나 기존 VLA 연구는 아키텍처, 학습 데이터, 로봇 구성, 벤치마크별 최적화 등 다양한 측면에서 복잡하고 단편적인 경향을 보입니다. 본 연구에서는 통제된 조건에서 VLA 설계 요소를 연구하기 위한 간단하면서도 강력한 기본 모델인 StarVLA-$α$를 소개합니다. StarVLA-$α$는 실험적 변수를 줄이고 체계적인 분석을 가능하게 하기 위해 아키텍처 및 파이프라인의 복잡성을 의도적으로 최소화합니다. 특히, 액션 모델링 전략, 로봇 특화 사전 학습, 인터페이스 엔지니어링 등 주요 설계 요소를 재평가합니다. LIBERO, SimplerEnv, RoboTwin, RoboCasa 등 다양한 벤치마크에서 동일한 간단한 기본 모델이 높은 경쟁력을 유지하는 것으로 나타났습니다. 이는 강력한 VLM(Vision-Language Model) 기반 모델과 최소한의 설계만으로도 추가적인 아키텍처 복잡성이나 엔지니어링 기술 없이도 강력한 성능을 달성할 수 있음을 시사합니다. 주목할 만한 점은, 당사의 단일 범용 모델이 공개된 실제 환경 RoboChallenge 벤치마크에서 $π_{0.5}$보다 20% 더 우수한 성능을 보였습니다. StarVLA-$α$는 향후 VLA 연구를 위한 견고한 출발점이 될 것으로 기대됩니다. 코드 및 관련 자료는 https://github.com/starVLA/starVLA 에서 확인할 수 있습니다.

Original Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

0 Citations

0 Influential

65.649878535139 Altmetric

328.2 Score

Original PDF

2,058

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!