Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multi-modal semantic segmentation significantly enhances AI agents'
perception and scene understanding, especially under adverse conditions like
low-light or overexposed environments. Leveraging additional modalities
(X-modality) like thermal and depth alongside traditional RGB provides
complementary information, enabling more robust and reliable prediction. In
this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic
segmentation utilizing the advanced Mamba. Unlike conventional methods that
rely on CNNs, with their limited local receptive fields, or Vision Transformers
(ViTs), which offer global receptive fields at the cost of quadratic
complexity, our model achieves global receptive fields with linear complexity.
By employing a Siamese encoder and innovating a Mamba-based fusion mechanism,
we effectively select essential information from different modalities. A
decoder is then developed to enhance the channel-wise modeling ability of the
model. Our proposed method is rigorously evaluated on both RGB-Thermal and
RGB-Depth semantic segmentation tasks, demonstrating its superiority and
marking the first successful application of State Space Models (SSMs) in
multi-modal perception tasks. Code is available at
https://github.com/zifuwan/Sigma.