2603.23481v1 Mar 24, 2026 cs.RO

VTAM: 비디오-촉각-행동 모델: 시각 기반 행위 모델의 한계를 넘어선 복잡한 물리적 상호작용

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Chuan Wen

Citations: 194

h-index: 4

Xinzhuo Li

Citations: 25

h-index: 2

Ismini Lourentzou

Citations: 1,363

h-index: 19

Wendi Chen

Citations: 130

h-index: 3

Cewu Lu

Citations: 322

h-index: 7

Weigang Yi

Citations: 19

h-index: 2

Yuchen Mo

Citations: 26

h-index: 2

Xiangyu Zeng

Citations: 6

h-index: 1

K. Driggs-Campbell

Citations: 2,264

h-index: 26

Haoran Yuan

Citations: 6

h-index: 1

Zhenyu Zhang

Citations: 9

h-index: 1

Jiashi Yin

Citations: 1

h-index: 1

비디오-행동 모델(VAM)은 생체 지능 분야에서 유망한 프레임워크로, 원시 비디오 스트림에서 암묵적인 세계 역학을 학습하여 시간적으로 일관된 행동 예측을 수행합니다. 이러한 모델은 시각적 추론을 통해 장기적인 작업에서 뛰어난 성능을 보이지만, 중요한 상호작용 상태가 시각 정보만으로는 부분적으로만 관찰될 수 있는 접촉이 많은 시나리오에서는 여전히 한계가 있습니다. 특히, 정밀한 힘 조절 및 접촉 전환은 시각적 토큰에 안정적으로 인코딩되지 않아 불안정하거나 부정확한 동작을 유발합니다. 이러한 간극을 해소하기 위해, 우리는 비디오-촉각 행동 모델(VTAM)을 소개합니다. VTAM은 촉각 인지 정보를 보완적인 기반 신호로 통합하는 다중 모드 세계 모델링 프레임워크입니다. VTAM은 사전 학습된 비디오 트랜스포머에 경량 모달리티 변환 파인튜닝을 통해 촉각 스트림을 추가하여, 촉각-언어 쌍 데이터나 독립적인 촉각 사전 학습 없이 효율적인 교차 모드 표현 학습을 가능하게 합니다. 다중 모드 융합을 안정화하기 위해, 우리는 촉각 정규화 손실을 도입하여 균형 잡힌 교차 모드 어텐션을 강제하고, 행동 모델에서 시각적 잠재력의 지배를 방지합니다. VTAM은 접촉이 많은 조작 작업에서 뛰어난 성능을 보이며, 평균적으로 90%의 높은 성공률을 유지합니다. 특히, 높은 정밀도의 힘 인지 능력이 요구되는 감자칩 집어 들기 및 배치와 같은 어려운 시나리오에서, VTAM은 pi 0.5 기준 모델보다 80% 더 우수한 성능을 보입니다. 우리의 연구 결과는 촉각 피드백 통합이 세계 행동 모델의 시각적 추정 오류를 수정하는 데 필수적이며, 물리적으로 기반한 생체 기반 모델을 확장 가능한 방식으로 구축하는 데 기여한다는 것을 보여줍니다.

Original Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

1 Citations

0 Influential

13 Altmetric

66.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!