arxiv_cv 91% Match Research Paper Computer Vision Researchers,Machine Learning Engineers,Autonomous Systems Developers,AI Researchers 1 week ago

Test-Time Adaptive Object Detection with Foundation Model

computer-vision › object-detection

📄 Abstract

Abstract: In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

Authors (4)

Yingjie Gao

Yanan Zhang

Zhi Cai

Di Huang

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data and overcomes closed-set limitations. It uses a multi-modal prompt-based mean-teacher framework with text and visual prompt tuning for parameter-efficient adaptation, enabling open-world detection.

Business Value

Enables object detection systems to adapt to new, unseen environments and data distributions in real-time without requiring retraining on source data, improving robustness and applicability.

Paper Metadata

Innovation Type

Foundation Model Application / Source-Free Adaptation

Deployment Feasibility

Moderate. Leverages foundation models, which can be computationally intensive. Parameter-efficient tuning helps mitigate this.

Limitations Addressed

Existing TTA methods require source data and assume a closed-set category space. This work enables source-free, open-world adaptation.

Technical Tags

Test-Time AdaptationObject DetectionFoundation ModelsOnline Domain AdaptationOpen-World DetectionVision-Language ModelsPrompt TuningMean-Teacher FrameworkParameter-EfficientZero-Shot Learning

Research Topics

Domain AdaptationObject DetectionFoundation ModelsTest-Time AdaptationVision-Language Integration

Methods & Architectures

Foundation Model-powered Test-Time AdaptationMulti-modal Prompt-based Mean-Teacher FrameworkText and Visual Prompt TuningTest-time Warm-start Strategy Vision-Language DetectorFoundation Models

Applications & Tasks

Autonomous Driving Robotics Surveillance Image Analysis Reliance on source data characteristicsAssumption of identical category spaceClosed-set limitationsNeed for online domain adaptation Test-time adaptive object detectionOpen-world object detectionOnline domain adaptation

Related Fields

Machine LearningComputer VisionDomain AdaptationFoundation ModelsNatural Language Processing

Keywords

Test-Time AdaptationObject DetectionFoundation ModelsDomain AdaptationOnline LearningOpen-WorldVision-LanguagePrompt TuningMean-TeacherParameter-EfficientZero-Shot

Academic Context

#Domain Adaptation#Object Detection#Foundation Models#Test-Time Adaptation#Vision-Language Integration

Commercial Potential

Potential Products

Adaptive object detection systems for dynamic environmentsRobust perception modules for autonomous agentsOn-the-fly domain adaptation tools

Target Industries

AutomotiveRoboticsSecuritySurveillanceLogistics

Use Case Examples

An autonomous vehicle adapting its object detection to different weather conditions or city layoutsA surveillance system identifying new object categories it wasn't explicitly trained onRobots adapting to novel object appearances in their workspace

Competitive Edge

Represents a significant advancement in test-time adaptation by leveraging foundation models for source-free, open-world adaptation, surpassing traditional methods limited by source data and closed-set assumptions.

Resource Requirements

Compute Needs

Requires significant compute for foundation models, but prompt tuning aims for parameter efficiency during adaptation.

Data Requirements

Requires unlabeled target domain data for test-time adaptation. Does not require source domain data.

Deployment Constraints

Reliance on large foundation models can be a bottleneck. Real-time adaptation speed is crucial for practical deployment.

Scalability

The parameter-efficient prompt tuning approach suggests good scalability in terms of adaptation speed and memory footprint compared to full model fine-tuning.

View Full Paper Back to Papers