arxiv_cv 95% Match Research Paper Computer Vision Researchers,Civil Engineers,Machine Learning Engineers,Autonomous Systems Developers 1 week ago

DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

computer-vision › object-detection

📄 Abstract

Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

Authors (6)

Malaisree P

Youwai S

Kitkobsin T

Janrungautai S

Amorndechaphon D

Rojanavasu P

Submitted

October 29, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

DINO-YOLO significantly improves data efficiency for object detection in specialized domains like civil engineering by integrating DINOv3 self-supervised features into a YOLO architecture. This approach achieves substantial performance gains with limited annotated data while maintaining real-time inference capabilities.

Business Value

Enables more accurate and efficient monitoring of infrastructure and construction sites with less manual annotation effort, leading to cost savings and improved safety.

Paper Metadata

Innovation Type

Hybrid Architecture / Feature Integration

Deployment Feasibility

High. Achieves real-time inference and demonstrates significant improvements on specialized datasets, making it suitable for field deployment.

Limitations Addressed

The scarcity of annotated data in specialized domains like civil engineering, which hinders the performance of traditional supervised object detection models.

Performance Gains

12.4% improvement on Tunnel Segment Crack detection,13.7% improvement on Construction PPE,88.6% improvement on KITTI

Technical Tags

Self-Supervised LearningObject DetectionData EfficiencyVision TransformersYOLOCivil EngineeringReal-time InferenceFeature IntegrationPre-trainingDomain Adaptation

Research Topics

Data-Efficient LearningObject DetectionSelf-Supervised LearningComputer VisionCivil Engineering Applications

Methods & Architectures

DINO-YOLODINOv3YOLOv12Feature Integration (P0, P3)Self-supervised Pre-training YOLOv12DINOv3 (Vision Transformer)

Applications & Tasks

Civil Engineering Construction Infrastructure Monitoring Autonomous Driving Limited annotated data in specialized domainsData-efficient object detection Tunnel Segment Crack DetectionConstruction PPE DetectionObject Detection in Civil Engineering

Datasets & Benchmarks

Datasets

KITTI

Benchmarks

Tunnel Segment Crack detection (648 images): 12.4% improvement • Construction PPE (1K images): 13.7% improvement • KITTI (7K images): 88.6% improvement • Real-time inference (30-47 FPS) • Medium-scale architectures (DualP0P3): 55.77% mAP@0.5 • Small-scale architectures (Triple Integration): 53.63% mAP@0.5 • 2-4x inference overhead (21-33ms vs 8-16ms baseline)

mAP@0.5FPS

Related Fields

Machine LearningDeep LearningComputer VisionCivil Engineering

Keywords

Object DetectionSelf-Supervised LearningData EfficiencyVision TransformersYOLOCivil EngineeringConstructionInfrastructureReal-timePre-trainingDINODeep Learning

Academic Context

#Data-Efficient Learning#Object Detection#Self-Supervised Learning#Computer Vision#Civil Engineering Applications

Technology Stack

Frameworks & Libraries

YOLOv12DINOv3

Commercial Potential

Potential Products

Automated defect detection systems for infrastructureSmart construction site monitoring tools

Target Industries

Civil EngineeringConstructionInfrastructure ManagementAutomotive

Use Case Examples

Detecting cracks in tunnelsIdentifying missing safety equipment on construction workersMonitoring construction progress

Competitive Edge

Outperforms traditional supervised methods in data-scarce environments and maintains real-time performance, offering a practical solution for specialized object detection tasks.

Resource Requirements

Compute Needs

Moderate GPU for training; efficient inference on edge devices or servers.

Data Requirements

Requires smaller, specialized datasets for fine-tuning and evaluation.

Deployment Constraints

Potential for increased inference time compared to baseline YOLO, depending on the scale and integration complexity.

Scalability

Scales across different YOLO architectures (small to large) and DINOv3 variants, allowing for trade-offs between performance and computational cost.

View Full Paper Back to Papers