2604.10591v1 Apr 12, 2026 cs.CV

GeoMeld: 의미론적으로 기반한 원격 감지 기초 모델을 향하여

GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

Md Aminur Hossain

Citations: 10

h-index: 2

Ayush V. Patel

Citations: 12

h-index: 3

B. Banerjee

Citations: 10

h-index: 2

M. Khan

Citations: 21

h-index: 3

Maram Hasan

Citations: 17

h-index: 2

Savitra Roy

Citations: 1

h-index: 1

S. Bhowmik

Citations: 7

h-index: 1

Mainak Singha

University of Trento

Citations: 299

h-index: 7

Subhasis Chaudhuri

Citations: 109

h-index: 3

원격 감성에서 효과적인 기초 모델링을 위해서는 공간적으로 정렬된 다양한 모달리티와 의미론적으로 기반한 지도 학습이 필수적이지만, 이러한 자원은 여전히 규모 면에서 제한적입니다. 본 논문에서는 약 250만 개의 공간적으로 정렬된 샘플을 포함하는 대규모 다중 모달 데이터셋인 GeoMeld를 소개합니다. 이 데이터셋은 다양한 모달리티와 해상도를 포괄하며, 모달리티 인지 표현 학습을 위한 통일된 정렬 프로토콜에 따라 구축되었습니다. GeoMeld는 에이전트 기반의 캡셔닝 프레임워크를 통해 의미론적으로 기반한 언어적 지도를 제공하며, 이를 통해 분광 신호, 지형 통계 및 구조화된 지리 메타데이터로부터 어노테이션을 합성하고 검증하여 텍스트 설명 내에 측정 가능한 다중 모달리티 간의 관계를 인코딩합니다. 이 데이터셋을 활용하기 위해, 우리는 다중 프리텍스트 마스크 자동 인코딩, JEPA 표현 학습 및 캡션-비전 대비 정렬을 결합한 사전 학습 프레임워크인 GeoMeld-FM을 제안합니다. 이러한 통합된 목표는 학습된 표현 공간이 신뢰할 수 있는 센서 간 물리적 일관성과 의미론적 정보를 모두 포착할 수 있도록 합니다. 실험 결과, GeoMeld는 다운스트림 작업에서의 성능 향상과 센서 간의 견고성 향상에 기여하는 것을 보여줍니다. GeoMeld와 GeoMeld-FM은 원격 감성 분야에서 의미론적으로 기반한 다중 모달 기초 모델링을 위한 확장 가능한 참조 프레임워크를 제공합니다.

Original Abstract

Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!