2603.16362v1 Mar 17, 2026 cs.CV

$D^3$-RSMDE: 40배 빠른 속도와 높은 정확도를 갖는 원격 감지 단안 심도 추정

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang

Citations: 76

h-index: 2

Weihan Li

Citations: 2

h-index: 1

Zunlei Feng

Citations: 7

h-index: 2

Haofei Zhang

Citations: 484

h-index: 11

Mingli Song

Citations: 93

h-index: 6

Jiayu Wang

Citations: 7

h-index: 1

Jie Song

Citations: 7

h-index: 2

Li Sun

Citations: 12

h-index: 2

실시간으로 높은 정확도를 갖는 단안 심도 추정은 다양한 응용 분야에서 매우 중요하지만, 기존 방법들은 정확도와 효율성 사이의 명확한 상충 관계를 가지고 있습니다. 비전 트랜스포머(ViT) 기반의 백본을 사용하여 밀집 예측을 수행하는 방법은 빠르지만, 종종 낮은 시각적 품질을 나타냅니다. 반대로, 확산 모델은 높은 정확도를 제공하지만, 계산 비용이 매우 높습니다. 이러한 한계점을 극복하기 위해, 우리는 원격 감지 단안 심도 추정을 위한 효율적인 프레임워크인 Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE)을 제안합니다. 이 프레임워크는 먼저 ViT 기반 모듈을 활용하여 고품질의 초기 심도 지도를 빠르게 생성하며, 이는 확산 모델의 시간 소모적인 초기 구조 생성 단계를 대체하는 구조적 사전 지식 역할을 합니다. 이러한 사전 지식을 기반으로, 우리는 Progressive Linear Blending Refinement (PLBR) 전략을 제안합니다. 이 전략은 가벼운 U-Net을 사용하여 몇 번의 반복만으로 세부 사항을 개선합니다. 전체 개선 과정은 Variational Autoencoder (VAE)에 의해 지원되는 작은 잠재 공간에서 효율적으로 작동합니다. 광범위한 실험 결과, $D^3$-RSMDE는 Marigold와 같은 선도 모델에 비해 Learned Perceptual Image Patch Similarity (LPIPS) 시각적 지표에서 11.85%의 감소를 달성했으며, 추론 속도는 40배 이상 향상되었고, VRAM 사용량은 가벼운 ViT 모델과 유사한 수준을 유지했습니다.

Original Abstract

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!