2603.18795v1 Mar 19, 2026 cs.CV

Perceptio: 공간 토큰 생성을 통한 시각 언어 모델의 공간 인지 능력 향상

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Amanmeet Garg

Citations: 831

h-index: 13

Garin Kessler

Citations: 8

h-index: 2

Shalini Chaudhuri

Citations: 1

h-index: 1

Rui Zhao

Citations: 7

h-index: 1

Yuchen Li

Citations: 1

h-index: 1

대규모 시각 언어 모델(LVLM)은 의미 이해에 뛰어난 성능을 보이지만, 모델이 복잡한 기하 구조를 암묵적으로 추론해야 하기 때문에 세밀한 공간 정보 처리에는 어려움을 겪습니다. 본 논문에서는 명시적인 의미 분할 토큰과 2D 및 3D 공간 추론 능력을 갖춘 Perceptio라는 시각 인지 능력이 향상된 LVLM을 제안합니다. Perceptio는 autoregressive 시퀀스 내에서 직접 생성되는 깊이 토큰을 통해 구현됩니다. 구체적으로, (i) 강력한 단안 지도 모델로부터 VQVAE 깊이 코드북을 활용하여 깊이 정보를 압축된 시퀀스로 토큰화하고, (ii) LLM 내부에 SAM2 기반의 의미 분할 토큰과 VQ-VAE 깊이 토큰을 통합하여 모델이 먼저 공간 토큰을 생성한 후 답변하도록 합니다. 깊이 토큰 생성의 안정성을 확보하기 위해, 새로운 복합 깊이 토큰 목표 함수(마커, 토큰, 개수 손실)와 미분 가능한 재구성을 위한 소프트 머징 기법을 도입했습니다. 다양한 데이터 세트를 활용한 다중 작업 공동 훈련 전략을 통해, 모델이 다양한 하위 작업에 대한 인지 토큰을 학습하도록 합니다. InternVL을 기반으로 구축된 Perceptio는 다양한 벤치마크에서 최첨단 성능을 달성했습니다. 구체적으로, 참조 표현 분할에서 cIoU 점수가 RefCOCO/+/g에서 각각 +0.8/+1.4/+1.1로 향상되었고, HardBLINK의 공간 이해 정확도가 10.3% 증가했으며, MMBench 정확도가 1.0% 향상되었습니다. 이는 명시적인 공간 추론 과정이 LVLM의 공간 정보 처리 능력을 크게 향상시킨다는 것을 보여줍니다.

Original Abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!