arxiv_cv 96% Match Research Paper Remote Sensing Scientists,Geospatial Analysts,AI Researchers,Environmental Scientists,Urban Planners 20 hours ago

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

large-language-models › multimodal-llms

📄 Abstract

Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Key Contributions

Introduces GeoLLaVA-8K, scaling multimodal LLMs to 8K resolution for remote sensing by addressing data scarcity with new VQA datasets (SuperRS-VQA, HighRS-VQA) and mitigating token explosion via Background Token Pruning and Anchored Token Selection. This enables analysis of previously intractable high-resolution imagery.

Business Value

Enables more detailed and accurate analysis of Earth's surface from high-resolution imagery, supporting applications in climate change monitoring, disaster response, precision agriculture, and urban development. It unlocks insights from vast amounts of previously unusable data.

Paper Metadata

Innovation Type

Dataset and Algorithmic

Deployment Feasibility

Moderate, requires significant computational resources for processing 8K imagery and running large multimodal models.

Limitations Addressed

Scarcity of ultra-high-resolution (UHR) remote sensing training data,Computational challenges (token explosion) when processing large UHR images with existing multimodal models

Performance Gains

Significant redundancy in RS images identified; crucial information concentrated in object-centric tokens.,Pruning background tokens can improve performance.

Technical Tags

remote sensingmultimodal LLM8K resolutionVQAtoken pruninganchored token selectiondata scarcitytoken explosionEarth observationlarge-scale datasets

Research Topics

Multimodal AILarge Language ModelsRemote SensingComputer VisionEarth Observation

Methods & Architectures

Background Token PruningAnchored Token SelectionVision-Language Model ScalingDataset Creation (SuperRS-VQA, HighRS-VQA) Multimodal Large Language Models

Applications & Tasks

Earth Observation Environmental Monitoring Urban Planning Agriculture Limited availability of UHR training dataToken explosion due to large image sizesProcessing high-resolution remote sensing imagery Visual Question Answering (VQA) on remote sensing imageryAnalyzing ultra-high-resolution satellite/aerial imagery

Datasets & Benchmarks

Datasets

SuperRS-VQA, HighRS-VQA

Related Fields

Artificial IntelligenceMachine LearningComputer VisionNatural Language ProcessingRemote SensingGeospatial Analysis

Keywords

Remote SensingMultimodal LLM8K ResolutionVQAToken PruningEarth ObservationSatellite ImageryAerial ImageryHigh ResolutionData ScarcityToken ExplosionGeospatial AI

Academic Context

#Multimodal AI#Large Language Models#Remote Sensing#Computer Vision#Earth Observation

Commercial Potential

Potential Products

Advanced satellite image analysis platformsAI-powered environmental monitoring systemsGeospatial intelligence tools

Target Industries

AerospaceEnvironmental ServicesAgricultureUrban PlanningDefenseInsurance

Use Case Examples

Identifying specific crop types or disease outbreaks from high-resolution satellite imagery.Monitoring deforestation or urban sprawl with unprecedented detail.Assessing damage after natural disasters using detailed aerial views.

Competitive Edge

Pioneers the application of multimodal LLMs to ultra-high-resolution remote sensing data, overcoming key bottlenecks that limit current capabilities.

Market Opportunity

Large and growing market for geospatial intelligence and Earth observation data analysis.

Revenue Models

SaaS platformsdata analysis serviceslicensing of models

Resource Requirements

Compute Needs

Very High (for training and inference with 8K imagery and large models)

Data Requirements

Ultra-high-resolution remote sensing imagery and corresponding VQA datasets.

Deployment Constraints

High computational requirements,Data storage and bandwidth for 8K imagery,Need for specialized geospatial data processing pipelines

Scalability

Scales with the resolution of imagery and the complexity of the VQA tasks.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate (Novel token selection strategies and dataset creation)

View Full Paper Back to Papers