2604.00813v1 Apr 01, 2026 cs.CV

DVGT-2: 대규모 자율 주행을 위한 시각-기하-행동 모델

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Sicheng Zuo

Citations: 206

h-index: 6

Jiwen Lu

Citations: 1,272

h-index: 17

Wenzhao Zheng

Citations: 21

h-index: 3

Hanbing Li

Citations: 51

h-index: 3

Zixun Xie

Citations: 105

h-index: 5

Shaoqing Xu

Citations: 382

h-index: 10

Fang Li

Citations: 86

h-index: 6

Long Chen

Citations: 248

h-index: 6

Zhi-Xin Yang

Citations: 102

h-index: 6

종래의 희소 인지 기반 자율 주행 방식은 시각-언어-행동(VLA) 모델로 진화해 왔으며, 이러한 모델은 계획 수립을 돕기 위해 언어 설명을 보조 과제로 학습하는 데 중점을 둡니다. 본 논문에서는 자율 주행에 있어 중요한 단서는 희소한 정보가 아닌 밀집된 3차원 기하 정보라는 것을 강조하는 시각-기하-행동(VGA) 패러다임을 제안합니다. 차량이 3차원 공간에서 작동하기 때문에, 우리는 밀집된 3차원 기하 정보가 의사 결정에 가장 포괄적인 정보를 제공한다고 생각합니다. 그러나 대부분의 기존 기하 정보 복원 방법(예: DVGT)은 계산 비용이 많이 드는 다중 프레임 입력을 일괄 처리해야 하며, 실시간 계획에 적용하기 어렵습니다. 이러한 문제를 해결하기 위해, 입력 데이터를 실시간으로 처리하고 현재 프레임에 대한 밀집된 기하 정보와 경로 계획을 동시에 출력하는 스트리밍 기반의 Driving Visual Geometry Transformer (DVGT-2)를 소개합니다. 우리는 시간적 인과 관계 기반 어텐션과 과거 특징 캐싱을 사용하여 실시간 추론을 지원합니다. 효율성을 더욱 향상시키기 위해, 슬라이딩 윈도우 스트리밍 전략을 제안하고 특정 간격 내의 과거 캐시를 사용하여 반복적인 계산을 피합니다. 더 빠른 속도에도 불구하고, DVGT-2는 다양한 데이터셋에서 우수한 기하 정보 복원 성능을 달성합니다. 훈련된 DVGT-2 모델은 추가적인 튜닝 없이도, 폐루프 NAVSIM 및 개방형 nuScenes 벤치마크를 포함한 다양한 카메라 구성에서 직접 계획 수립에 적용될 수 있습니다.

Original Abstract

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

5 Citations

1 Influential

8.5 Altmetric

49.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!