Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Retrieving relevant imagery from vast satellite archives is crucial for
applications like disaster response and long-term climate monitoring. However,
most text-to-image retrieval systems are limited to RGB data, failing to
exploit the unique physical information captured by other sensors, such as the
all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the
spectral signatures in optical multispectral data. To bridge this gap, we
introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1
SAR and Sentinel-2 multispectral images paired with structured textual
annotations for land cover, land use, and crisis events harmonized from
authoritative land cover systems (CORINE and Dynamic World) and crisis-specific
sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining),
a novel framework that uses text as a bridge to align unpaired optical and SAR
images into a unified embedding space. Our experiments show that CLOSP achieves
a new state-of-the-art, improving retrieval nDGC@1000 by 54% over existing
models. Additionally, we find that the unified training strategy overcomes the
inherent difficulty of interpreting SAR imagery by transferring rich semantic
knowledge from the optical domain with indirect interaction. Furthermore,
GeoCLOSP, which integrates geographic coordinates into our framework, creates a
powerful trade-off between generality and specificity: while the CLOSP excels
at general semantic tasks, the GeoCLOSP becomes a specialized expert for
retrieving location-dependent crisis events and rare geographic features. This
work highlights that the integration of diverse sensor data and geographic
context is essential for unlocking the full potential of remote sensing
archives.