2602.22716v1 Feb 26, 2026 cs.CV

SoPE: 구면 좌표 기반 위치 임베딩을 통한 3D 멀티모달 모델의 공간 인식 향상

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Jianmin Ji

Citations: 454

h-index: 12

Guanting Ye

Citations: 232

h-index: 7

Qiyan Zhao

Citations: 7

h-index: 1

Wenhao Yu

Citations: 9

h-index: 1

Xiaofeng Zhang

Citations: 9

h-index: 2

Yanyong Zhang

Citations: 141

h-index: 7

Ka-Veng Yuen

Citations: 84

h-index: 3

Mingkai Li

Citations: 287

h-index: 6

Qing Jiang

Citations: 1,537

h-index: 10

Li Yuan

Citations: 2,339

h-index: 7

대규모 언어 모델(LLM)을 기반으로 구축된 3D 멀티모달 모델(3D LVLM)은 다양한 멀티모달 작업에서 상당한 발전을 이루었습니다. 그러나 이러한 모델들이 상속받은 위치 의존적 모델링 방식인 로터리 위치 임베딩(RoPE)은 3D 멀티모달 이해에 여전히 최적이 아닙니다. 기본적인 RoPE 방식은 3D 토큰을 인코딩할 때 필수적인 3차원 공간 구조를 유지하지 못하며, 상대적인 거리 계산 시 각도 의존성을 고려하지 않아 모델이 시각적 표현에서의 방향성 변화를 포착하는 능력을 저해합니다. 이러한 한계점을 극복하기 위해, 우리는 구면 좌표 기반 위치 임베딩(SoPE)을 제안합니다. 우리 방법은 포인트 클라우드 토큰 인덱스를 3차원 구면 좌표 공간으로 매핑하여 공간 위치와 방향 각도를 통합적으로 모델링합니다. 이러한 방식은 포인트 클라우드 데이터의 고유한 기하학적 구조를 유지하고, 공간 인식을 향상시키며, 멀티모달 학습을 위한 더욱 일관성 있고 표현력이 풍부한 기하학적 표현을 제공합니다. 또한, 다양한 주파수 영역의 특징 정보를 융합하기 위한 멀티 스케일 주파수 혼합 전략을 도입했습니다. 여러 3D 장면 벤치마크에서의 실험 결과는 우리 접근 방식의 효과를 검증하며, 실제 환경에서의 배포 실험은 그 뛰어난 일반화 능력을 더욱 입증합니다.

Original Abstract

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

1 Citations

0 Influential

6 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!