2605.07474v1 May 08, 2026 cs.CV

ForgeVLA: 언어 어노테이션 없이 연합된 시각-언어-행동 학습 프레임워크

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

J. Lyu

Citations: 0

h-index: 0

Yuhao Zhou

Citations: 477

h-index: 9

Yun Zhu

Citations: 21

h-index: 3

Yang Zhou

Citations: 83

h-index: 4

Jian Lan

Citations: 33

h-index: 4

Zhangyuan Wang

Citations: 58

h-index: 4

Dan Si

Citations: 29

h-index: 2

Thomas Seidl

Citations: 46

h-index: 3

Qing Ye

Citations: 346

h-index: 8

Jiancheng Lyu

Citations: 12

h-index: 1

시각-언어-행동(VLA) 모델은 범용 로봇 지능을 위한 큰 잠재력을 가지고 있지만, 어노테이션된 학습 데이터를 확보하는 높은 비용으로 인해 모델의 확장이 심각하게 제한됩니다. 다행히도, 다양한 환경에 배치된 시각 장착 로봇은 풍부한 시각-행동 쌍을 생성하며, 이를 활용하여 VLA 학습을 더욱 효율적으로 확장할 수 있습니다. 그러나 이러한 원시 데이터는 다양한 제약 조건으로 인해 중앙 집중화될 수 없으며, 심각한 이질성을 보입니다. 이러한 과제를 해결하기 위해, 본 논문에서는 중앙 집중화를 하지 않고 수동 어노테이션을 필요로 하지 않으면서 분산된 시각-행동 쌍으로부터 VLA 모델을 학습하는 연합 VLA 학습 프레임워크인 ForgeVLA를 제안합니다. 특히, ForgeVLA의 각 클라이언트는 시각-행동 쌍을 미리 정의된 명령어 집합에 매핑하는 임베디드 명령어 분류기를 장착하여 누락된 언어 모달리티를 복원하고 완전한 시각-언어-행동 튜플을 형성합니다. 튜플 생성 외에도, ForgeVLA는 기존의 연합 VLA 연구에서 간과되었던 중요한 문제인 시각-언어 특징 붕괴를 식별합니다. 이 문제를 완화하기 위해, ForgeVLA는 클라이언트 측 대비 학습 손실과 서버 측 적응적 집계 전략을 결합하여 효율적으로 작업 식별적인 표현을 학습합니다. 여러 벤치마크에서 수행된 광범위한 실험 결과, ForgeVLA는 다른 기본 모델보다 훨씬 뛰어난 성능을 보이며, 추가적인 분석을 통해 각 구성 요소의 기여도를 검증합니다.

Original Abstract

Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!