Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 96% Match Research Paper Remote Sensing Scientists,Geospatial Analysts,AI Researchers,Environmental Scientists,Urban Planners 17 hours ago

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

large-language-models › multimodal-llms
📄 Abstract

Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Key Contributions

Introduces GeoLLaVA-8K, scaling multimodal LLMs to 8K resolution for remote sensing by addressing data scarcity with new VQA datasets (SuperRS-VQA, HighRS-VQA) and mitigating token explosion via Background Token Pruning and Anchored Token Selection. This enables analysis of previously intractable high-resolution imagery.

Business Value

Enables more detailed and accurate analysis of Earth's surface from high-resolution imagery, supporting applications in climate change monitoring, disaster response, precision agriculture, and urban development. It unlocks insights from vast amounts of previously unusable data.