arxiv_ai 80% Match Research Paper Computer vision researchers,ML engineers,Robotics engineers 1 week ago

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

computer-vision › object-detection

📄 Abstract

Abstract: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

Authors (3)

Md Tanvir Hossain

Akif Islam

Mohd Ruhul Ameen

Submitted

October 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

CountFormer introduces a transformer-based framework for class-agnostic object counting that learns visual repetition and structure. It utilizes the DINOv2 foundation model for richer features and incorporates positional embedding fusion, achieving state-of-the-art performance on benchmarks like FSC-147.

Business Value

Enables more robust and versatile object counting systems for applications like inventory management, traffic monitoring, and autonomous systems, even in challenging visual conditions.

Paper Metadata

Innovation Type

Novel Architecture and Feature Representation

Deployment Feasibility

Moderate. Requires significant computational resources for the Transformer and DINOv2 components.

Limitations Addressed

Existing counting models often miscount objects with complex shapes, internal symmetry, or overlapping components; failure to replicate human ability to count based on visual repetition and structure.

Performance Gains

Comparable to current state-of-the-art methods on FSC-147.

Technical Tags

Object CountingTransformerClass-Agnostic CountingVisual RepetitionStructural RelationshipsDINOv2Density MapsSelf-Supervised Learning

Research Topics

Computer VisionObject CountingRepresentation LearningDeep Learning ArchitecturesPattern Recognition

Methods & Architectures

Transformer-based framework (CountFormer)DINOv2 foundation model as visual encoderPositional embedding fusionLightweight convolutional decoderDensity map generation CountFormerTransformerConvolutional Decoder

Applications & Tasks

Computer Vision Image Analysis Robotics Surveillance Class-agnostic object countingCounting objects with complex shapes or overlaps Learning visual repetition and structure for countingAccurate object counting without class labels

Datasets & Benchmarks

Datasets

FSC-147

Benchmarks

FSC-147: Comparable to state-of-the-art methods

AccuracyCounting performance

Related Fields

Computer VisionMachine LearningDeep LearningPattern RecognitionImage Processing

Keywords

object countingtransformerclass-agnosticvisual repetitionstructural relationshipsDINOv2density mapscomputer visiondeep learningFSC-147

Academic Context

#Computer Vision#Object Counting#Representation Learning#Deep Learning Architectures#Pattern Recognition

Commercial Potential

Potential Products

Automated inventory counting systemsTraffic analysis toolsRobotic vision systems

Target Industries

RetailLogisticsSmart CitiesManufacturingRobotics

Use Case Examples

Counting products on shelvesMonitoring vehicle density on roadsAssessing crowd sizes

Competitive Edge

Achieves state-of-the-art performance by leveraging powerful foundation models (DINOv2) and a Transformer architecture specifically designed to capture visual repetition and structure for counting.

Market Opportunity

Significant, as object counting is a fundamental task in many computer vision applications.

Revenue Models

Licensing of the CountFormer modelintegration into vision SDKs.

Resource Requirements

Compute Needs

High, due to the use of Transformer and DINOv2.

Data Requirements

Large-scale datasets with object annotations for counting tasks (e.g., FSC-147).

Deployment Constraints

Computational cost, memory requirements for large models.

Scalability

Scalability depends on the efficiency of the Transformer implementation and the size of the input images. Density map generation is generally efficient.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for integration into commercial vision systems.

Patent Potential

Moderate, for the specific CountFormer architecture and its application to class-agnostic counting.

View Full Paper Back to Papers