2603.01549v1 Mar 02, 2026 cs.CV

Pri4R: 특권적인 4차원 표현을 활용한 시각-언어-행동 모델을 위한 세계 역학 학습

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Sanghyeok Chu

Citations: 51

h-index: 3

Ananya Bal

Citations: 27

h-index: 3

Gunhee Lee

Citations: 27

h-index: 2

Bohyung Han

Citations: 59

h-index: 4

László A. Jeni

Citations: 2,982

h-index: 30

Jisoo Kim

Citations: 71

h-index: 2

J. Cho

Citations: 20

h-index: 2

Jinhyung Kim

Citations: 19

h-index: 2

Sihaeng Lee

Citations: 1,088

h-index: 10

Seungryong Kim

Citations: 335

h-index: 9

Hyunmin Lee

Citations: 75

h-index: 3

사람은 자신의 신체 움직임뿐만 아니라 주변 세계가 자신의 행동에 어떻게 반응하는지도 학습합니다. 반면, 최근의 시각-언어-행동(VLA) 모델은 뛰어난 의미론적 이해 능력을 보여주지만, 종종 물리적 상호작용을 지배하는 시공간적 역학을 제대로 파악하지 못합니다. 본 논문에서는 Pri4R이라는 간단하면서도 효과적인 방법을 제안합니다. Pri4R은 훈련 과정에서 특권적인 4차원 정보를 활용하여 VLA 모델에 세계 역학에 대한 암묵적인 이해를 부여합니다. 구체적으로, Pri4R은 VLA 모델에 경량화된 포인트 트랙 헤드를 추가하여 3차원 포인트 트랙을 예측합니다. 이 헤드에 VLA 특징을 주입하여 미래의 3차원 궤적을 공동으로 예측함으로써, 모델은 공유된 표현 공간 내에 변화하는 장면 기하학을 통합하여 더 물리적인 상황 인식을 가능하게 하여 정밀한 제어를 지원합니다. Pri4R은 구조적으로 단순하여 기존의 주요 VLA 설계 패턴과 최소한의 변경으로 호환됩니다. 추론 과정에서 모델은 원래의 VLA 아키텍처를 그대로 사용하며, Pri4R은 추가적인 입력, 출력 또는 계산 오버헤드를 발생시키지 않습니다. 시뮬레이션 및 실제 환경 평가에서 Pri4R은 어려운 조작 작업에서 성능을 크게 향상시켰습니다. 특히, LIBERO-Long에서 +10%, RoboCasa에서 +40%의 성능 향상을 보였습니다. 또한, 3차원 포인트 트랙 예측이 행동-세계 역학 학습을 위한 효과적인 감독 신호임을 보여주고, 광범위한 분석을 통해 설계 선택을 검증했습니다.

Original Abstract

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.

4 Citations

1 Influential

15 Altmetric

81.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!