Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data
for Earth observation but pose challenges for existing multimodal foundation
models due to two key bottlenecks: (1) limited availability of UHR training
data, and (2) token explosion caused by the large image size. To address data
scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA
(avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in
RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion,
our pilot studies reveal significant redundancy in RS images: crucial
information is concentrated in a small subset of object-centric tokens, while
pruning background tokens (e.g., ocean or forest) can even improve performance.
Motivated by these findings, we propose two strategies: Background Token
Pruning and Anchored Token Selection, to reduce the memory footprint while
preserving key semantics.Integrating these techniques, we introduce
GeoLLaVA-8K, the first RS-focused multimodal large language model capable of
handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework.
Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art
on the XLRS-Bench.
Key Contributions
Introduces GeoLLaVA-8K, scaling multimodal LLMs to 8K resolution for remote sensing by addressing data scarcity with new VQA datasets (SuperRS-VQA, HighRS-VQA) and mitigating token explosion via Background Token Pruning and Anchored Token Selection. This enables analysis of previously intractable high-resolution imagery.
Business Value
Enables more detailed and accurate analysis of Earth's surface from high-resolution imagery, supporting applications in climate change monitoring, disaster response, precision agriculture, and urban development. It unlocks insights from vast amounts of previously unusable data.