2603.14523v1 Mar 15, 2026 cs.CV

VLA-Thinker: 이미지 기반 추론을 통한 시각-언어-행동 모델 성능 향상

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Yunhao Ge

Citations: 784

h-index: 10

Sicheng Gao

Citations: 8

h-index: 2

Bingxin Xu

Citations: 284

h-index: 4

Yuzhang Shang

Citations: 4

h-index: 1

Chaoyang Wang

Citations: 18

h-index: 3

Wenrui Bao

Citations: 5

h-index: 2

Yu Tian

Citations: 54

h-index: 4

Y. Rawat

Citations: 2,143

h-index: 22

시각-언어-행동(VLA) 모델은 내재적 지능 분야에서 유망한 가능성을 보여주었지만, 대부분의 기존 접근 방식은 텍스트 기반의 연쇄적 사고 방식을 사용하며, 시각적 입력은 정적인 맥락으로 취급됩니다. 이는 모델이 환경을 능동적으로 재검토하고 장기적인 작업 동안의 모호성을 해결하는 능력을 제한합니다. 본 연구에서는 VLA-Thinker라는 이미지 기반 추론 프레임워크를 제안합니다. 이 프레임워크는 지각을 동적으로 호출 가능한 추론 행동으로 모델링합니다. 이러한 시스템을 학습하기 위해, (1) 구조화된 추론 및 도구 사용 행동을 활성화하기 위한 큐레이션된 시각적 연쇄적 사고 데이터를 활용한 SFT(Supervised Fine-Tuning) 초기 학습 단계와 (2) GRPO(Generalized Reinforcement Learning with Preference Optimization) 기반 강화 학습을 통해 전체적인 추론-행동 경로를 작업 수준의 성공과 일치시키는 두 단계의 학습 파이프라인을 도입했습니다. LIBERO 및 RoboTwin 2.0 벤치마크에 대한 광범위한 실험 결과, VLA-Thinker는 조작 성능을 크게 향상시켜 LIBERO에서 97.5%의 성공률을 달성했으며, 장기적인 로봇 작업 전반에 걸쳐 상당한 개선 효과를 보였습니다. 프로젝트 및 코드: https://cywang735.github.io/VLA-Thinker/.

Original Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!