2606.17020v1 Jun 15, 2026 cs.CV

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

Jiujiang Guo
Jiujiang Guo
Citations: 73
h-index: 5
Yiwei Wei
Yiwei Wei
Citations: 360
h-index: 12
Chengyin Hu
Chengyin Hu
Citations: 6
h-index: 1
Yuxiang Dong
Yuxiang Dong
Citations: 50
h-index: 2
Xu Sun
Xu Sun
Citations: 99
h-index: 3
Jiajun Han
Jiajun Han
Citations: 0
h-index: 0
Qike Zhang
Qike Zhang
Citations: 22
h-index: 3
Benqi Zhang
Benqi Zhang
Citations: 1
h-index: 1
Fengyu Zhang
Fengyu Zhang
Citations: 2
h-index: 1

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

0 Citations
0 Influential
6 Altmetric
30.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!