2602.19190v1 Feb 22, 2026 cs.CV

FUSAR-GPT: SAR 영상을 위한 시공간 특징 임베딩 및 2단계 분리형 시각 언어 모델

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

Xiaokun Zhang

Citations: 4

h-index: 1

Baiyun

Citations: 18

h-index: 2

Qingchen Fang

Citations: 4

h-index: 1

Ruyi Zhang

Citations: 11

h-index: 2

Haipeng Wang

Citations: 20

h-index: 3

Yi Yang

Citations: 39

h-index: 3

Ziqi Ye

Citations: 13

h-index: 3

Xiaorong Guo

Citations: 0

h-index: 0

Xinpeng Zhou

Citations: 2

h-index: 1

전천후 및 24시간 관측이 가능한 합성개구레이더(SAR)의 지능적 해석에 대한 연구는 원격 탐사 애플리케이션의 발전에 있어 매우 중요하다. 최근 몇 년간 시각 언어 모델(VLM)이 RGB 이미지에서 강력한 개방형 세계(open-world) 이해 능력을 보여주었으나, 이미징 메커니즘의 복잡성, 산란 특징에 대한 민감성, 고품질 텍스트 말뭉치의 부족으로 인해 SAR 분야에 직접 적용될 경우 그 성능이 크게 제한된다. 이러한 문제를 체계적으로 해결하기 위해, 본 연구에서는 최초의 SAR 이미지-텍스트-AlphaEarth 특징의 트리플렛(triplet) 데이터셋을 구축하고 SAR 전용 VLM인 FUSAR-GPT를 개발하였다. FUSAR-GPT는 지형 공간 베이스라인 모델을 '세계 지식(world knowledge)' 사전 지식(prior)으로 혁신적으로 도입하며, '시공간 앵커'를 통해 다중 소스 원격 탐사 시계열 특징을 모델의 시각적 백본에 임베딩함으로써 SAR 이미지 내 객체의 희소한 표현을 동적으로 보완할 수 있도록 한다. 나아가 대형 모델의 지식 주입과 작업 수행을 분리(decouple)하기 위해 2단계 SFT 전략을 설계하였다. 시공간 특징 임베딩과 2단계 분리 패러다임 덕분에 FUSAR-GPT는 여러 대표적인 원격 탐사 시각-언어 벤치마크 테스트에서 주류 베이스라인 모델들을 12% 이상 크게 앞지르며 최고 수준(state-of-the-art)의 성능을 달성하였다.

Original Abstract

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!