arxiv_cv 90% Match Research Paper Computer Vision Researchers,AI Developers,Biometric System Engineers 2 weeks ago

Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

computer-vision › object-detection

📄 Abstract

Abstract: Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: https://github.com/Husk021118/Proto-Former.

Authors (7)

Shengkai Hu

Haozhe Qi

Jun Wan

Jiaxing Huang

Lefei Zhang

Hang Sun

+1 more

Submitted

October 17, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proto-Former is a unified, adaptive, end-to-end framework for facial landmark detection that enables joint training across multiple datasets with varying landmark definitions. It uses an Adaptive Prototype-Aware Encoder and Progressive Prototype-Aware Decoder to learn dataset-specific facial structural representations (prototypes).

Business Value

Enables more robust and versatile facial landmark detection systems, crucial for applications like facial animation, emotion recognition, and augmented reality filters.

Paper Metadata

Innovation Type

Architectural Innovation

Deployment Feasibility

Feasible, as it's an end-to-end framework. Requires standard deep learning infrastructure.

Limitations Addressed

Limitations of single-dataset training, poor model generalization to different datasets, and the inability to develop a unified model for facial landmark detection.

Performance Gains

Improved generalization across different datasets and enables unified training.

Technical Tags

Facial Landmark DetectionDeep LearningTransformerPrototype LearningUnified FrameworkMulti-Dataset TrainingAdaptive EncoderProgressive DecoderGeneralizationFacial Structure

Research Topics

Computer VisionDeep LearningFacial AnalysisTransformer NetworksMulti-Task Learning

Methods & Architectures

Transformer architecturePrototype learningAdaptive Prototype-Aware Encoder (APAE)Progressive Prototype-Aware Decoder (PPAD)Joint training across datasets Transformer

Applications & Tasks

Computer Vision Biometrics Human-Computer Interaction Augmented Reality Virtual Reality Facial Landmark DetectionDataset GeneralizationUnified Model Training Facial Landmark Detection

Related Fields

Machine LearningDeep LearningPattern RecognitionBiometrics

Keywords

Facial Landmark DetectionTransformerPrototype LearningDeep LearningComputer VisionMulti-Dataset TrainingUnified ModelFacial AnalysisBiometricsAPAEPPAD

Academic Context

#Computer Vision#Deep Learning#Facial Analysis#Transformer Networks#Multi-Task Learning

Commercial Potential

Potential Products

Facial analysis SDKsAR/VR facial tracking softwareEmotion recognition systems

Target Industries

TechnologyGamingSocial MediaAutomotive (driver monitoring)

Use Case Examples

Accurate tracking of facial features for virtual avatarsReal-time emotion detection from facial expressionsEnhancing augmented reality experiences

Competitive Edge

Provides a unified framework for facial landmark detection that overcomes single-dataset limitations and improves generalization.

Resource Requirements

Data Requirements

Requires diverse facial landmark datasets for joint training.

Deployment Constraints

Performance may depend on the diversity and quality of training data.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers