Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Prompt-learning-based multi-modal trackers have made strong progress by using
lightweight visual adapters to inject auxiliary-modality cues into frozen
foundation models. However, they still underutilize two essentials:
modality-specific frequency structure and long-range temporal dependencies. We
present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework
that injects lightweight prompts into a frozen RGB tracker. A frequency-guided
visual adapter adaptively transfers complementary cues across modalities by
jointly calibrating spatial, channel, and frequency components, narrowing the
modality gap without full fine-tuning. A multilevel memory adapter with short,
long, and permanent memory stores, updates, and retrieves reliable temporal
context, enabling consistent propagation across frames and robust recovery from
occlusion, motion blur, and illumination changes. This unified design preserves
the efficiency of prompt learning while strengthening cross-modal interaction
and temporal coherence. Extensive experiments on RGB-Thermal, RGB-Depth, and
RGB-Event benchmarks show consistent state-of-the-art results over fully
fine-tuned and adapter-based baselines, together with favorable parameter
efficiency and runtime. Code and models are available at
https://github.com/xuboyue1999/mmtrack.git.
Key Contributions
This paper introduces a dual-adapter framework for multi-modal object tracking that addresses underutilization of modality-specific frequency structure and long-range temporal dependencies. It proposes a frequency-guided visual adapter for cross-modal cue transfer and a multilevel memory adapter for robust temporal context propagation, enhancing efficiency while strengthening cross-modal fusion.
Business Value
Improved accuracy and robustness in video tracking applications can lead to better performance in surveillance, autonomous driving, and content analysis, reducing manual effort and increasing reliability.