2602.05789v1 Feb 05, 2026 cs.CV

공간 중심 인식 모델: 프레임 인스턴스화를 통한 공간 중심 추론과 자아 중심 시각적 선입견 분리

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang

Citations: 18

h-index: 3

Ruiqiang Zhang

Citations: 3

h-index: 1

Chang Liu

Citations: 236

h-index: 5

Guanjie Wang

Citations: 45

h-index: 3

Zehua Ma

Citations: 944

h-index: 15

Han Fang

Citations: 1,398

h-index: 18

Weiming Zhang

Citations: 18

h-index: 3

컴퓨터 비전 및 자연어 처리 모델(VLM)에서 공간 정보를 활용하는 시각-언어 네비게이션/행동과 같은 작업의 중요성이 높아짐에 따라, 공간 중심 인식 능력에 대한 관심이 커지고 있습니다. 그러나 VLM은 여전히 명시적인 시점 변화를 요구하는 공간 중심 쿼리에 취약하며, 이때 답변은 관찰된 카메라 시점이 아닌 목표 중심 프레임에서의 추론에 의존합니다. 이에, 우리는 훈련 과정이 필요 없는 Allocentric Perceiver라는 방법을 제안합니다. 이 방법은 기존의 기하학적 전문가를 활용하여 하나 이상의 이미지로부터 메트릭 3D 상태를 복원하고, 지침의 의미적 의도에 맞춰 쿼리에 조건부로 적용되는 공간 중심 참조 프레임을 생성합니다. Allocentric Perceiver는 복원된 기하학적 정보를 목표 프레임으로 결정적으로 변환하고, 구조화된 기하학 기반 표현을 통해 핵심 VLM에 정보를 제공함으로써, 암묵적인 추론에서 명시적인 계산으로의 전환을 가능하게 합니다. 우리는 Allocentric Perceiver를 다양한 VLM 아키텍처에 적용하여 공간 추론 벤치마크에서 평가한 결과, 공간 중심 작업에서 일관되고 상당한 성능 향상(약 10%)을 보였으며, 동시에 강력한 자아 중심 성능을 유지했습니다. 또한, 이 방법은 공간 인식 미세 조정 모델과 최첨단 오픈 소스 및 독점 모델 모두를 능가하는 성능을 보였습니다.

Original Abstract

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!