2601.06474v2 Jan 10, 2026 cs.CV

SparseOccVLA: 희소 쿼리를 활용하여 점유율 정보와 시각-언어 모델을 연결하고, 통합된 4차원 장면 이해 및 계획을 구현하는 방법

SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning

Chenxu Dang

Citations: 108

h-index: 6

Jie Wang

Citations: 46

h-index: 3

Guang Li

Citations: 59

h-index: 3

Zhiwen Hou

Citations: 25

h-index: 2

Zihan You

Citations: 17

h-index: 3

Hangjun Ye

Citations: 132

h-index: 7

Jie Ma

Citations: 10

h-index: 2

Long Chen

Citations: 3

h-index: 1

Yan Wang

Citations: 26

h-index: 3

자율 주행 시스템에서 시각-언어 모델(VLMs)은 고수준 추론에 뛰어나고, 의미론적 점유율은 세밀한 정보를 제공합니다. 각 분야는 상당한 발전을 이루었지만, 두 가지 패러다임을 효과적으로 통합하는 방법은 아직 없습니다. 기존의 VLM은 토큰 폭증 및 제한적인 시공간 추론 문제를 가지고 있으며, 의미론적 점유율은 통합적인 공간 표현을 제공하지만 VLM과 효율적으로 통합하기에는 너무 밀집되어 있습니다. 이러한 문제점을 해결하고 VLM과 점유율 정보를 연결하기 위해, 우리는 희소 점유율 쿼리를 활용하여 장면 이해, 점유율 예측, 경로 계획을 통합하는 새로운 시각-언어-행동 모델인 SparseOccVLA를 제안합니다. SparseOccVLA는 가벼운 희소 점유율 인코더를 시작으로, 시각과 언어 사이의 단일 연결 고리 역할을 하는 간결하면서도 유용한 희소 점유율 쿼리를 생성합니다. 이러한 쿼리는 언어 공간으로 정렬되고, LLM에 의해 통합된 장면 이해 및 미래 점유율 예측을 위해 추론됩니다. 또한, LLM 기반 앵커-디퓨전 플래너를 도입하여 분리된 앵커 점수 및 디노이징 기능을 제공하며, 모델 간의 경로 조건 융합을 수행합니다. SparseOccVLA는 OmniDrive-nuScenes 데이터셋에서 CIDEr 점수 7% 향상, Occ3D-nuScenes 데이터셋에서 mIoU 점수 0.5 증가를 달성했으며, nuScenes 벤치마크에서 최고 수준의 개방형 계획 성능을 보여주며, 강력한 통합 능력을 입증합니다.

Original Abstract

In autonomous driving, Vision Language Models (VLMs) excel at high-level reasoning , whereas semantic occupancy provides fine-grained details. Despite significant progress in individual fields, there is still no method that can effectively integrate both paradigms. Conventional VLMs struggle with token explosion and limited spatiotemporal reasoning, while semantic occupancy provides a unified, explicit spatial representation but is too dense to integrate efficiently with VLMs. To address these challenges and bridge the gap between VLMs and occupancy, we propose SparseOccVLA, a novel vision-language-action model that unifies scene understanding, occupancy forecasting, and trajectory planning powered by sparse occupancy queries. Starting with a lightweight Sparse Occupancy Encoder, SparseOccVLA generates compact yet highly informative sparse occupancy queries that serve as the single bridge between vision and language. These queries are aligned into the language space and reasoned by the LLM for unified scene understanding and future occupancy forecasting. Furthermore, we introduce an LLM-guided Anchor-Diffusion Planner featuring decoupled anchor scoring and denoising, as well as cross-model trajectory-condition fusion. SparseOccVLA achieves a 7% relative improvement in CIDEr over the state-of-the-art on OmniDrive-nuScenes, a 0.5 increase in mIoU score on Occ3D-nuScenes, and sets state-of-the-art open-loop planning metric on nuScenes benchmark, demonstrating its strong holistic capability.

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!