Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In recent years, test-time adaptive object detection has attracted increasing
attention due to its unique advantages in online domain adaptation, which
aligns more closely with real-world application scenarios. However, existing
approaches heavily rely on source-derived statistical characteristics while
making the strong assumption that the source and target domains share an
identical category space. In this paper, we propose the first foundation
model-powered test-time adaptive object detection method that eliminates the
need for source data entirely and overcomes traditional closed-set limitations.
Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for
vision-language detector-driven test-time adaptation, which incorporates text
and visual prompt tuning to adapt both language and vision representation
spaces on the test data in a parameter-efficient manner. Correspondingly, we
propose a Test-time Warm-start strategy tailored for the visual prompts to
effectively preserve the representation capability of the vision branch.
Furthermore, to guarantee high-quality pseudo-labels in every test batch, we
maintain an Instance Dynamic Memory (IDM) module that stores high-quality
pseudo-labels from previous test samples, and propose two novel
strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's
high-quality instances for enhancing original predictions and hallucinating
images without available pseudo-labels, respectively. Extensive experiments on
cross-corruption and cross-dataset benchmarks demonstrate that our method
consistently outperforms previous state-of-the-art methods, and can adapt to
arbitrary cross-domain and cross-category target data. Code is available at
https://github.com/gaoyingjay/ttaod_foundation.
Authors (4)
Yingjie Gao
Yanan Zhang
Zhi Cai
Di Huang
Submitted
October 29, 2025
Key Contributions
This paper proposes the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data and overcomes closed-set limitations. It uses a multi-modal prompt-based mean-teacher framework with text and visual prompt tuning for parameter-efficient adaptation, enabling open-world detection.
Business Value
Enables object detection systems to adapt to new, unseen environments and data distributions in real-time without requiring retraining on source data, improving robustness and applicability.