arxiv_cv 75% Match Research Paper Remote sensing scientists,Geospatial analysts,Computer vision researchers,AI engineers working with large-scale imagery 20 hours ago

RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

computer-vision › 3d-vision

📄 Abstract

Abstract: Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.

Key Contributions

RoMA proposes a framework for scalable self-supervised pretraining of Mamba-based foundation models for remote sensing, addressing the quadratic complexity of ViTs. It introduces a rotation-aware pretraining mechanism and an auto-regressive learning strategy to handle high-resolution images and sparsely distributed objects with arbitrary orientations.

Business Value

Enables more efficient and scalable analysis of large-scale remote sensing data, leading to improved insights for applications like environmental monitoring, urban planning, and disaster management.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Feasible for deployment in cloud-based geospatial analysis platforms. Requires significant computational resources for pretraining.

Limitations Addressed

Quadratic complexity of self-attention in ViTs hindering scalability for large models and high-resolution images in remote sensing; limited applications of Mamba in supervised tasks on small datasets.

Technical Tags

MambaFoundation ModelsRemote SensingSelf-Supervised LearningVision Transformers (ViTs)Linear ComplexityRotation-Aware PretrainingAngular EmbeddingsHigh-Resolution Images

Research Topics

Remote Sensing Image AnalysisFoundation ModelsSelf-Supervised LearningEfficient Deep Learning ArchitecturesComputer Vision

Methods & Architectures

Mamba architectureSelf-supervised learningRotation-aware pretrainingAdaptive croppingAngular embeddingsAuto-regressive learning MambaVision Transformer (ViT)

Applications & Tasks

Remote Sensing Geospatial Analysis Earth Observation Scaling foundation models for high-resolution remote sensing dataHandling sparsely distributed objects with arbitrary orientationsEfficient processing of large-scale remote sensing datasets Remote sensing image analysisObject detection in remote sensingImage classificationSegmentation

Related Fields

Computer VisionRemote SensingMachine LearningDeep LearningNatural Language Processing (for Mamba architecture inspiration)

Keywords

Remote sensingFoundation modelsMambaSelf-supervised learningVision TransformersScalabilityHigh-resolution imageryRotation invarianceGeospatial AIDeep learning

Academic Context

#Remote Sensing Image Analysis#Foundation Models#Self-Supervised Learning#Efficient Deep Learning Architectures#Computer Vision

Commercial Potential

Potential Products

Remote sensing data analysis platformsGeospatial intelligence servicesAI-powered earth observation tools

Target Industries

AgricultureEnvironmental monitoringUrban planningDefense and IntelligenceDisaster management

Use Case Examples

Automated crop monitoring using satellite imageryLand cover classification for environmental studiesUrban sprawl analysisChange detection in satellite images

Competitive Edge

Offers a more scalable and efficient alternative to ViT-based foundation models for remote sensing tasks, particularly for high-resolution data, by leveraging the Mamba architecture.

Market Opportunity

Significant and growing market for remote sensing data analysis and geospatial intelligence.

Revenue Models

API access to foundation modelsSaaS platforms for analysisconsulting services.

Resource Requirements

Compute Needs

High (for pretraining large foundation models)

Data Requirements

Large-scale, diverse, unlabeled remote sensing data.

Deployment Constraints

Computational cost for inference, potential need for specialized hardware.

Scalability

Designed for scalability to large datasets and high-resolution images due to the linear complexity of Mamba.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate (novel pretraining strategies and architecture adaptations)

View Full Paper Back to Papers