Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: With the exponential growth of data, traditional object detection methods are
increasingly struggling to handle vast vocabulary object detection tasks
effectively. We analyze two key limitations of classification-based detectors:
positive gradient dilution, where rare positive categories receive insufficient
learning signals, and hard negative gradient dilution, where discriminative
gradients are overwhelmed by numerous easy negatives. To address these
challenges, we propose CQ-DINO, a category query-based object detection
framework that reformulates classification as a contrastive task between object
queries and learnable category queries. Our method introduces image-guided
query selection, which reduces the negative space by adaptively retrieving
top-K relevant categories per image via cross-attention, thereby rebalancing
gradient distributions and facilitating implicit hard example mining.
Furthermore, CQ-DINO flexibly integrates explicit hierarchical category
relationships in structured datasets (e.g., V3Det) or learns implicit category
correlations via self-attention in generic datasets (e.g., COCO). Experiments
demonstrate that CQ-DINO achieves superior performance on the challenging V3Det
benchmark (surpassing previous methods by 2.1% AP) while maintaining
competitiveness in COCO. Our work provides a scalable solution for real-world
detection systems requiring wide category coverage. The code is publicly at
https://github.com/FireRedTeam/CQ-DINO.
Key Contributions
CQ-DINO addresses gradient dilution in vast vocabulary object detection by reformulating classification as a contrastive task using category queries. It employs image-guided query selection via cross-attention to rebalance gradients and mine hard examples, significantly improving performance on large-scale object detection tasks.
Business Value
Enables more comprehensive visual understanding in applications with a wide range of objects, such as large-scale image indexing, autonomous navigation, and retail analytics.