arxiv_cv 95% Match Research Paper Computer Vision Researchers,3D Graphics Engineers,Robotics Engineers 3 days ago

MVSMamba: Multi-View Stereo with State Space Model

computer-vision › 3d-vision

📄 Abstract

Abstract: Robust feature representations are essential for learning-based Multi-View Stereo (MVS), which relies on accurate feature matching. Recent MVS methods leverage Transformers to capture long-range dependencies based on local features extracted by conventional feature pyramid networks. However, the quadratic complexity of Transformer-based MVS methods poses challenges to balance performance and efficiency. Motivated by the global modeling capability and linear complexity of the Mamba architecture, we propose MVSMamba, the first Mamba-based MVS network. MVSMamba enables efficient global feature aggregation with minimal computational overhead. To fully exploit Mamba's potential in MVS, we propose a Dynamic Mamba module (DM-module) based on a novel reference-centered dynamic scanning strategy, which enables: (1) Efficient intra- and inter-view feature interaction from the reference to source views, (2) Omnidirectional multi-view feature representations, and (3) Multi-scale global feature aggregation. Extensive experimental results demonstrate MVSMamba outperforms state-of-the-art MVS methods on the DTU dataset and the Tanks-and-Temples benchmark with both superior performance and efficiency. The source code is available at https://github.com/JianfeiJ/MVSMamba.

Authors (7)

Jianfei Jiang

Qiankun Liu

Hongyuan Liu

Haochen Yu

Liyong Wang

Jiansheng Chen

+1 more

Submitted

November 3, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

MVSMamba is the first Mamba-based MVS network, offering efficient global feature aggregation with linear complexity, overcoming the quadratic complexity of Transformer-based methods. It introduces a Dynamic Mamba module with a reference-centered dynamic scanning strategy for improved intra/inter-view feature interaction, omnidirectional representations, and multi-scale global feature aggregation.

Business Value

Enables more efficient and accurate 3D reconstruction for applications like autonomous driving, robotics, and virtual/augmented reality, potentially reducing computational costs and improving real-time performance.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as Mamba architecture offers linear complexity, making it more efficient than Transformers for real-time applications.

Limitations Addressed

Quadratic complexity of Transformer-based MVS methods, challenges in balancing performance and efficiency, limited long-range dependency capture in traditional FPNs.

Technical Tags

Multi-View StereoState Space ModelMambaFeature AggregationDynamic ScanningReference-CenteredOmnidirectional FeaturesMulti-Scale FeaturesTransformerFeature Pyramid Networks

Research Topics

3D ReconstructionComputer VisionDeep Learning ArchitecturesGeometric ModelingFeature Representation

Methods & Architectures

Mamba ArchitectureDynamic Mamba ModuleReference-Centered Dynamic ScanningFeature Pyramid Networks MambaTransformer

Applications & Tasks

3D Vision Computer Graphics Robotics Augmented Reality 3D ReconstructionFeature MatchingGeometric Accuracy Multi-View Stereo (MVS)Depth Estimation3D Scene Reconstruction

Related Fields

Computer VisionDeep Learning3D GeometryRobotics

Keywords

Multi-View StereoMambaState Space Model3D ReconstructionDeep LearningFeature ExtractionComputer VisionEfficiencyGlobal AggregationDynamic ModuleReference-CenteredOmnidirectional

Academic Context

#3D Reconstruction#Computer Vision#Deep Learning Architectures#Geometric Modeling#Feature Representation

Commercial Potential

Potential Products

3D scanning softwareReal-time 3D reconstruction systemsAR/VR content creation tools

Target Industries

AutomotiveRoboticsGamingFilm and EntertainmentArchitecture

Use Case Examples

Generating 3D models from multiple camera viewsAutonomous vehicle perception systemsVirtual reality environment reconstruction

Competitive Edge

Offers a more efficient alternative to Transformer-based MVS methods by leveraging the linear complexity of Mamba, potentially achieving comparable or better performance with reduced computational cost.

Market Opportunity

Growing market for 3D sensing and reconstruction technologies.

Revenue Models

Licensing of technologyintegration into existing software platforms.

Resource Requirements

Compute Needs

Likely moderate to high, depending on the scale of the MVS problem, but expected to be more efficient than Transformer-based methods.

Data Requirements

Requires multi-view image datasets with known camera poses.

Deployment Constraints

Requires accurate camera calibration for multi-view input.

Scalability

The linear complexity of Mamba suggests good scalability with input size.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

View Full Paper Back to Papers