Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The rapid development of deep learning has significantly improved salient
object detection (SOD) combining both RGB and thermal (RGB-T) images. However,
existing Transformer-based RGB-T SOD models with quadratic complexity are
memory-intensive, limiting their application in high-resolution bimodal feature
fusion. To overcome this limitation, we propose a purely Fourier
Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for
accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier
Transform with linear complexity to design three key components: (1) To fuse
RGB and thermal modalities, we propose Modal-coordinated Perception Attention,
which aligns and enhances bimodal Fourier representation in multiple
dimensions; (2) To clarify object edges and suppress noise, we design
Frequency-decomposed Edge-aware Block, which deeply decomposes and filters
Fourier components of low-level features; (3) To accurately decode features, we
propose Fourier Residual Channel Attention Block, which prioritizes
high-frequency information while aligning channel-wise global relationships.
Additionally, even when converged, existing deep learning-based SOD models'
predictions still exhibit frequency gaps relative to ground-truth. To address
this problem, we propose Co-focus Frequency Loss, which dynamically weights
hard frequencies during edge frequency reconstruction by cross-referencing
bimodal edge information in the Fourier domain. Extensive experiments on ten
bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine
existing state-of-the-art bimodal SOD models. Comprehensive ablation studies
further validate the value and effectiveness of our newly proposed components.
The code is available at https://github.com/JoshuaLPF/FreqSal.
Key Contributions
FreqSal proposes a purely Fourier Transform-based model for RGB-T Salient Object Detection (SOD), achieving linear complexity and reducing memory usage compared to Transformer models. It introduces novel components like Modal-coordinated Perception Attention and Frequency-decomposed Edge-aware Blocks to effectively fuse bimodal features and enhance edge detection.
Business Value
Enables more efficient and accurate salient object detection using multi-modal data (RGB and thermal), beneficial for applications requiring robust object identification in various lighting and environmental conditions.