2603.21341v1 Mar 22, 2026 cs.AI

RoboAlign: 시각-언어-행동 모델에서 테스트 시 추론 학습을 통한 언어-행동 정렬

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

Seungku Kim

Citations: 0

h-index: 0

Dongyoung Kim

Citations: 112

h-index: 7

Sumin Park

Citations: 1

h-index: 1

Woomin Song

Citations: 75

h-index: 4

Taeyoung Kim

Citations: 32

h-index: 2

Huiwon Jang

Citations: 700

h-index: 8

Jinwoo Shin

Citations: 47

h-index: 2

Jaehyung Kim

Citations: 0

h-index: 0

Younggyo Seo

Citations: 1,849

h-index: 20

다중 모드-대규모 언어 모델(MLLM)에서 신체적 추론 능력을 향상시키는 것은, 이러한 모델을 기반으로 시각-언어-행동 모델(VLA)을 구축하여 다중 모드 이해를 저수준 행동으로 쉽게 변환하는 데 필수적입니다. 이에 따라 최근 연구에서는 시각-질의응답 유형의 감독을 통해 MLLM의 신체적 추론 능력을 향상시키는 방법을 모색해 왔습니다. 그러나 이러한 접근 방식은 불안정한 VLA 성능을 초래하며, 종종 미미하거나 심지어 부정적인 결과를 낳는 것으로 보고되었습니다. 본 논문에서는 VLA 성능을 안정적으로 향상시키는 보다 체계적인 MLLM 학습 프레임워크인 RoboAlign을 제안합니다. 우리의 핵심 아이디어는 제로샷 자연어 추론을 통해 행동 토큰을 샘플링하고, 강화 학습(RL)을 사용하여 이 추론을 개선하여 행동 정확도를 높이는 것입니다. 결과적으로, RoboAlign은 MLLM에서 언어와 저수준 행동 간의 모달리티 간격을 해소하고, MLLM에서 VLA로의 지식 전달을 용이하게 합니다. RoboAlign의 효과를 검증하기 위해, 우리는 MLLM 백본 위에 확산 기반의 행동 헤드를 추가하여 VLA 모델을 학습하고 주요 로봇 벤치마크에서 평가했습니다. 놀랍게도, 전체 데이터의 1% 미만을 사용하여 지도 학습(SFT) 후 RL 기반 정렬을 수행함으로써, RoboAlign은 LIBERO, CALVIN, 그리고 실제 환경에서 SFT 기준 모델보다 각각 17.5%, 18.9%, 106.6%의 성능 향상을 달성했습니다.

Original Abstract

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!