2602.05049v1 Feb 04, 2026 cs.CV

VISTA: 시각적 조건부 학습 강화 - 비전-언어-행동 모델에서 트랙 추적 선호도 최적화를 통한 성능 향상

VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models

Yiye Chen

Citations: 14

h-index: 3

Yanan Jian

Citations: 116

h-index: 4

Xiaoyi Dong

Citations: 15

h-index: 1

Shuxin Cao

Citations: 115

h-index: 4

Jingyu Wu

Citations: 39

h-index: 3

Patricio Vela

Citations: 4

h-index: 1

Benjamin E. Lundell

Citations: 10

h-index: 2

Dongdong Chen

Citations: 147

h-index: 5

비전-언어-행동(VLA) 모델은 다양한 로봇 조작 작업에서 뛰어난 성능을 보여줍니다. 하지만, 사전 학습된 대규모 비전-언어 모델(VLM)을 행동 공간으로 확장하는 과정에서 비전-행동 불일치가 발생할 수 있으며, 이는 현재의 시각적 상태에 대한 행동 예측의 의존성이 약화되어 신뢰할 수 없는 행동 결과를 초래합니다. 본 연구에서는 VLA 모델을 시각적 조건부 학습의 관점에서 분석하고, 성공적인 실행 과정이 실패한 과정보다 더 강한 시각적 의존성을 나타낸다는 것을 실험적으로 입증합니다. 이러한 관찰을 바탕으로, VLA 모델의 시각적 조건부 학습을 명시적으로 강화하는 훈련 프레임워크를 제안합니다. 우리의 접근 방식은 먼저 트랙 추적을 모방하는 임무에서 선호도 최적화를 통해 행동 예측을 시각적 입력과 일치시키고, 그 후 지도 학습 미세 조정 과정에서 잠재 공간 증류를 통해 향상된 일치성을 명령어 추종 임무로 이전합니다. 아키텍처 수정이나 추가 데이터 수집 없이, 본 방법은 OpenVLA의 이산적인 설정에서 시각적 조건부 학습과 작업 성능을 모두 향상시키며, 연속적인 OpenVLA-OFT 환경으로 확장될 때도 일관된 성능 향상을 보입니다. 프로젝트 웹사이트: https://vista-vla.github.io/ .

Original Abstract

Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!