2603.13966v1 Mar 14, 2026 cs.AI

vla-eval: 비전-언어-행동 모델을 위한 통합 평가 프레임워크

vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models

Dieter Fox

Citations: 984

h-index: 12

Ranjay Krishna

Citations: 3,312

h-index: 22

Chris Dongjoo Kim

Citations: 82

h-index: 3

Suhwan Choi

Citations: 29

h-index: 3

YUN-OO Lee

Citations: 3

h-index: 1

Yubeen Park

Citations: 9

h-index: 2

Youngjae Yu

Citations: 6

h-index: 2

비전-언어-행동(VLA) 모델은 일반적으로 각 모델 저장소에서 독립적으로 관리되는 벤치마크 스크립트를 사용하여 평가되는데, 이로 인해 코드 중복, 의존성 충돌 및 명확하게 정의되지 않은 프로토콜 문제가 발생합니다. 본 논문에서는 Docker 기반 환경 격리를 통해 모델 추론과 벤치마크 실행을 분리하는 오픈 소스 평가 프레임워크인 vla-eval을 소개합니다. 모델은 단일 `predict()` 메서드를 구현하여 한 번 통합되며, 벤치마크는 네 가지 메서드 인터페이스를 통해 한 번 통합됩니다. 전체 교차 평가 행렬은 자동으로 작동합니다. 완전한 평가는 단 두 개의 명령어를 통해 수행됩니다: `vla eval serve` 및 `vla eval run`. 이 프레임워크는 13개의 시뮬레이션 벤치마크와 6개의 모델 서버를 지원합니다. 에피소드 분할 및 배치 추론을 통한 병렬 평가는 47배의 처리량 향상을 달성하여, 약 18분 만에 2000개의 LIBERO 에피소드를 완료할 수 있습니다. 이 인프라를 사용하여 발표된 VLA 모델에 대한 재현성 감사를 세 가지 벤치마크에서 수행한 결과, 세 가지 벤치마크 모두 발표된 값과 거의 일치하는 것을 확인했지만, 문서화되지 않은 요구 사항, 모호한 종료 의미, 결과에 영향을 미칠 수 있는 숨겨진 정규화 통계가 존재함을 발견했습니다. 또한, 17개의 벤치마크에서 발표된 657개의 결과를 종합한 VLA 리더보드를 공개합니다. 프레임워크, 평가 구성 및 모든 재현 결과는 공개적으로 이용 가능합니다.

Original Abstract

Vision Language Action VLA models are typically evaluated using per benchmark scripts maintained independently by each model repository, leading to duplicated code, dependency conflicts, and underspecified protocols. We present vla eval, an open source evaluation harness that decouples model inference from benchmark execution through a WebSocket msgpack protocol with Docker based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via a four method interface; the full cross evaluation matrix works automatically. A complete evaluation requires only two commands: vla eval serve and vla eval run. The framework supports 13 simulation benchmarks and six model servers. Parallel evaluation via episode sharding and batch inference achieves a 47x throughput improvement, completing 2000 LIBERO episodes in about 18 minutes. Using this infrastructure, we conduct a reproducibility audit of a published VLA model across three benchmarks, finding that all three closely reproduce published values while uncovering undocumented requirements ambiguous termination semantics and hidden normalization statistics that can silently distort results. We additionally release a VLA leaderboard aggregating 657 published results across 17 benchmarks. Framework, evaluation configs, and all reproduction results are publicly available.

3 Citations

0 Influential

11 Altmetric

58.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!