2601.07518v1 Jan 12, 2026 cs.CV

Mon3tr: 사전 구축된 가우시안 아바타를 활용한 단안 3D 원격 투영 시스템

Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

Zhening Liu

Citations: 366

h-index: 9

Fangyu Lin

Citations: 0

h-index: 0

Yingdong Hu

Citations: 66

h-index: 5

Yufan Zhuang

Citations: 6

h-index: 1

Zehong Lin

Citations: 719

h-index: 14

Jun Zhang

Citations: 104

h-index: 7

몰입형 원격 투영 기술은 증강 현실(AR) 및 가상 현실(VR) 애플리케이션에서 인간 상호작용을 혁신적으로 변화시킬 잠재력을 가지고 있으며, 이를 위해 실제와 같은 전신 홀로그램 표현을 통해 원격 협업을 향상시킵니다. 그러나 기존 시스템은 하드웨어 집약적인 다중 카메라 설정을 필요로 하며, 3차원 스트리밍에 높은 대역폭을 요구하여 모바일 장치에서의 실시간 성능을 제한합니다. 이러한 문제점을 해결하기 위해, 우리는 3D 가우시안 스플래팅(3DGS) 기반의 파라메트릭 인간 모델링을 원격 투영 시스템에 처음으로 통합한 새로운 단안 3D 원격 투영 프레임워크인 Mon3tr을 제안합니다. Mon3tr은 계산 과정을 두 단계로 나누는 'amortization' 전략을 채택합니다. 첫 번째 단계는 사용자와 관련된 아바타를 구축하기 위한 오프라인 다중 뷰 재구성 단계이며, 두 번째 단계는 실시간 원격 투영 세션 동안 단안으로 수행되는 추론 단계입니다. 단안 RGB 카메라를 사용하여 신체 움직임과 표정을 실시간으로 캡처하여 3DGS 기반의 파라메트릭 인간 모델을 제어함으로써 시스템의 복잡성과 비용을 크게 줄입니다. 추출된 움직임 및 외관 특징은 WebRTC의 데이터 채널을 통해 0.2 Mbps 미만의 속도로 전송되어 네트워크 변화에 대한 안정적인 적응을 가능하게 합니다. 수신 측, 예를 들어 Meta Quest 3에서, 우리는 가볍고 효율적인 3DGS 속성 변형 네트워크를 개발하여 사전 구축된 아바타에 대해 실시간으로 수정된 3DGS 속성을 생성하고, 약 60 FPS의 프레임률로 사실적인 움직임과 외관을 합성합니다. 광범위한 실험 결과, 우리의 방법은 최첨단 성능을 달성하며, 새로운 자세에 대해 28 dB 이상의 PSNR 값을, 전체 지연 시간 약 80 ms를 보이며, 포인트 클라우드 스트리밍에 비해 1000배 이상의 대역폭 감소를 제공하는 동시에 다양한 시나리오에서 단안 입력으로 실시간 작동을 지원합니다. 데모는 https://mon3tr3d.github.io 에서 확인할 수 있습니다.

Original Abstract

Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at < 0.2 Mbps over WebRTC's data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of > 28 dB for novel poses, an end-to-end latency of ~ 80 ms, and > 1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at https://mon3tr3d.github.io.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!