2603.18002v1 Mar 18, 2026 cs.CV

Loc3R-VLM: 시각-언어 모델을 활용한 언어 기반 위치 인식 및 3차원 추론

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Mahdi Rad

Citations: 230

h-index: 5

Mihai Dusmanu

Citations: 1,442

h-index: 10

Marc Pollefeys

Citations: 705

h-index: 13

Kevin Qu

Citations: 149

h-index: 3

Haozhe Qi

Citations: 249

h-index: 5

Rui Wang

Citations: 74

h-index: 4

다중 모드 대규모 언어 모델(MLLM)은 시각과 언어를 연결하는 데 상당한 발전을 이루었지만, 여전히 공간 이해와 시점 인지 추론에 어려움을 겪고 있습니다. 최근 연구에서는 모델이 명시적으로 3차원 공간에서 추론하도록 훈련하는 대신, 입력 표현에 기하학적 정보를 추가하는 데 초점을 맞추고 있습니다. 본 연구에서는 단안(monocular) 비디오 입력을 통해 2차원 시각-언어 모델에 고급 3차원 이해 능력을 부여하는 프레임워크인 Loc3R-VLM을 소개합니다. 인간의 공간 인지 능력에서 영감을 받은 Loc3R-VLM은 두 가지 주요 목표를 기반으로 합니다. 첫째, 장면 구조의 전체적인 표현을 구축하기 위한 전역 레이아웃 재구성이고, 둘째, 자아 중심 관점을 고정하기 위한 명시적인 상황 모델링입니다. 이러한 목표는 직접적인 공간적 감독 신호를 제공하여 인지 및 언어를 3차원 맥락에 연결합니다. 또한, 사전 훈련된 3차원 기반 모델에서 추출한 경량 카메라 자세 정보를 활용하여 기하학적 일관성과 메트릭 스케일 정렬을 보장합니다. Loc3R-VLM은 언어 기반 위치 인식 분야에서 최첨단 성능을 달성했으며, 특정 상황 및 일반적인 3차원 질의응답 벤치마크에서 기존의 2차원 및 비디오 기반 접근 방식보다 우수한 성능을 보였습니다. 이는 본 연구의 공간적 감독 프레임워크가 강력한 3차원 이해 능력을 가능하게 한다는 것을 입증합니다. 프로젝트 페이지: https://kevinqu7.github.io/loc3r-vlm

Original Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!