2602.07082v1 Feb 06, 2026 cs.CV

MosaicThinker: 임베디드 AI를 위한 반복적인 공간 표현 구축을 통한 온디바이스 시각 공간 추론

MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

Haoming Wang

Citations: 40

h-index: 3

Qiyao Xue

Citations: 99

h-index: 3

Weichen Liu

Citations: 20

h-index: 2

Wei Gao

Citations: 37

h-index: 3

임베디드 AI가 전통적인 객체 감지 및 인식을 넘어 로봇 조작 및 액추에이션 계획과 같은 더욱 발전된 작업으로 확장됨에 따라, 비디오 입력을 통해 객체 간의 공간적 관계를 인식하고 장치 동작을 안내하기 위한 시각 공간 추론이 필수적입니다. 그러나 기존의 시각 언어 모델(VLM)은 3차원 공간 정보에 대한 지식 부족으로 인해 공간 추론 능력이 매우 미약하며, 특히 여러 비디오 프레임에 걸친 복잡한 공간적 관계를 다루는 추론 작업에서 이러한 한계가 두드러집니다. 본 논문에서는 온디바이스 임베디드 AI를 위한 새로운 추론 시간 컴퓨팅 기술인 extit{MosaicThinker}를 제시합니다. 이는 제한된 자원을 가진 온디바이스 VLM의 공간 추론 능력을 향상시켜 어려운 프레임 간 추론 작업을 수행할 수 있도록 합니다. 우리의 기본 아이디어는 여러 프레임에서 얻은 단편적인 공간 정보를 통합하여 전체 의미 지도의 통일된 공간 표현을 만들고, 시각적 프롬프트를 통해 VLM이 의미 지도를 기반으로 공간 추론을 수행하도록 유도하는 것입니다. 실험 결과는 우리 기술이 다양한 유형과 복잡성을 가진 추론 작업에서, 제한된 자원을 가진 임베디드 AI 장치의 프레임 간 공간 추론 정확도를 크게 향상시킬 수 있음을 보여줍니다.

Original Abstract

When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.

2 Citations

0 Influential

1.5 Altmetric

9.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!