2603.15386v1 Mar 16, 2026 cs.CV

RieMind: 기하학 기반 공간 추론 에이전트를 활용한 장면 이해

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Yongliang Wang

Citations: 15

h-index: 2

Fernando Ropero

Citations: 163

h-index: 4

Erkin Turkoz

Citations: 0

h-index: 0

Antonio Ruiz

Bielefeld University

Citations: 6

h-index: 1

Yanfeng Zhang

Citations: 29

h-index: 2

Lu Liu

Citations: 2

h-index: 1

Mingwei Sun

Citations: 19

h-index: 3

Jun Du

Citations: 17

h-index: 2

Daniel Matos

Huawei Technologies

Citations: 26

h-index: 3

시각 언어 모델(VLM)은 실내 장면 이해의 주요 패러다임으로 자리 잡고 있지만, 여전히 기하학적 및 공간 추론에 어려움을 겪고 있습니다. 현재의 접근 방식은 종종 영상 전체를 이해하거나 대규모 공간 질의 응답을 통해 미세 조정하는데, 이는 인식과 추론을 본질적으로 결합합니다. 본 논문에서는 인식과 추론을 분리하는 것이 공간 추론 성능 향상에 도움이 되는지 조사합니다. 우리는 정적 3차원 실내 장면 추론을 위한 에이전트 기반 프레임워크를 제안하며, 이는 LLM을 명시적인 3차원 장면 그래프(3DSG)에 연결합니다. 영상 데이터를 직접 사용하는 대신, 각 장면은 전용 인식 모듈에 의해 구축된 지속적인 3DSG로 표현됩니다. 추론 성능을 분리하기 위해, 우리는 3DSG를 실제 주석을 기반으로 생성합니다. 에이전트는 객체의 크기, 거리, 자세 및 공간 관계와 같은 기본적인 속성을 노출하는 구조화된 기하학적 도구를 통해 장면과 상호 작용합니다. VSI-Bench의 정적 데이터셋에 대한 결과는 이상적인 인식 조건 하에서 공간 추론 성능의 상한을 제공하며, 작업별 미세 조정 없이도 이전 연구보다 최대 16% 더 높은 성능을 보였습니다. 기본 VLM과 비교했을 때, 우리의 에이전트 기반 모델은 평균 33%에서 50% 사이의 상당한 성능 향상을 보입니다. 이러한 결과는 명시적인 기하학적 연결이 공간 추론 성능을 크게 향상시킨다는 것을 시사하며, 구조화된 표현이 순수하게 end-to-end 방식으로 시각적 추론을 수행하는 것보다 매력적인 대안을 제공할 수 있음을 보여줍니다.

Original Abstract

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.

2 Citations

0 Influential

2 Altmetric

12.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!