Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Automated monitoring of marine mammals in the St. Lawrence Estuary faces
extreme challenges: calls span low-frequency moans to ultrasonic clicks, often
overlap, and are embedded in variable anthropogenic and environmental noise. We
introduce a multi-step, attention-guided framework that first segments
spectrograms to generate soft masks of biologically relevant energy and then
fuses these masks with the raw inputs for multi-band, denoised classification.
Image and mask embeddings are integrated via mid-level fusion, enabling the
model to focus on salient spectrogram regions while preserving global context.
Using real-world recordings from the Saguenay St. Lawrence Marine Park Research
Station in Canada, we demonstrate that segmentation-driven attention and
mid-level fusion improve signal discrimination, reduce false positive
detections, and produce reliable representations for operational marine mammal
monitoring across diverse environmental conditions and signal-to-noise ratios.
Beyond in-distribution evaluation, we further assess the generalization of
Mask-Guided Classification (MGC) under distributional shifts by testing on
spectrograms generated with alternative acoustic transformations. While
high-capacity baseline models lose accuracy in this Out-of-distribution (OOD)
setting, MGC maintains stable performance, with even simple fusion mechanisms
(gated, concat) achieving comparable results across distributions. This
robustness highlights the capacity of MGC to learn transferable representations
rather than overfitting to a specific transformation, thereby reinforcing its
suitability for large-scale, real-world biodiversity monitoring. We show that
in all experimental settings, the MGC framework consistently outperforms
baseline architectures, yielding substantial gains in accuracy on both
in-distribution and OOD data.
Authors (4)
Amine Razig
Youssef Soulaymani
Loubna Benabbou
Pierre Cauchy
Submitted
October 29, 2025
Key Contributions
Introduces a multi-step, attention-guided framework for underwater bioacoustic denoising and recognition. It employs spectrogram segmentation to generate masks, fuses these masks with raw inputs via mid-level fusion, and uses multi-band processing to improve signal discrimination, reduce false positives, and enable reliable marine mammal monitoring across diverse conditions.
Business Value
Enhances the ability to monitor and protect marine ecosystems by providing more accurate and reliable automated detection of marine mammals. This supports conservation efforts, environmental impact assessments, and scientific research.