arxiv_cv 96% Match Research Paper Computer Vision Researchers,AI Researchers,Robotics Engineers,Video Analysis Specialists 1 week ago

Generative Point Tracking with Flow Matching

generative-ai › flow-models

📄 Abstract

Abstract: Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.

Authors (5)

Mattie Tesfaldet

Adam W. Harley

Konstantinos G. Derpanis

Derek Nowrouzezahrai

Christopher Pal

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper introduces GenPT, a generative framework for point tracking that uses flow matching to model multi-modal trajectories, overcoming the limitations of discriminative trackers that regress to a single mean. GenPT combines iterative refinement, cross-window consistency, and a specialized variance schedule, and leverages generative capabilities for improved trajectory estimation via best-first search.

Business Value

Enhances the accuracy and robustness of video analysis systems, enabling applications in autonomous driving, surveillance, robotics, and augmented reality.

Paper Metadata

Innovation Type

Algorithmic / Generative Approach

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference. Integration into real-time systems needs optimization.

Limitations Addressed

Discriminative trackers' inability to capture multi-modal trajectories,Regression to a single mean in the presence of uncertainty,Difficulty in tracking through visual obfuscations (appearance changes, occlusions)

Performance Gains

Improved point trajectory estimates,Ability to model and represent multi-modal trajectories

Technical Tags

point trackingvideo analysisgenerative modelsflow matchingmulti-modal trajectoriesuncertainty modelingocclusion handlingdiscriminative tracking

Research Topics

Computer VisionGenerative ModelsVideo UnderstandingTrajectory PredictionDeep Learning

Methods & Architectures

Generative Point Tracker (GenPT)Flow MatchingIterative RefinementWindow-dependent PriorBest-first Search Generative ModelFlow Matching Model

Applications & Tasks

Computer Vision Video Analysis Robotics Augmented Reality Point TrackingTrajectory PredictionHandling UncertaintyCapturing Multi-modality Tracking points in videosModeling multi-modal trajectoriesImproving trajectory estimation through generation

Related Fields

Computer VisionMachine LearningDeep LearningRoboticsSignal Processing

Keywords

Point TrackingGenerative ModelsFlow MatchingVideo AnalysisTrajectory PredictionMulti-modalUncertaintyOcclusionComputer VisionDeep LearningGenPT

Academic Context

#Computer Vision#Generative Models#Video Understanding#Trajectory Prediction#Deep Learning

Commercial Potential

Potential Products

Advanced video tracking softwareRobotics perception modulesAR/VR tracking systems

Target Industries

TechnologyAutomotiveSecurityEntertainmentRobotics

Use Case Examples

Tracking objects or individuals in surveillance footageEnabling robots to follow moving targetsImproving motion capture and animation

Competitive Edge

Offers a generative approach to point tracking that explicitly models multi-modal trajectories and uncertainty, outperforming discriminative methods in complex scenarios.

Market Opportunity

Significant market in video analytics, robotics, and AR/VR.

Revenue Models

Licensing of algorithmsintegration into software platforms.

Resource Requirements

Compute Needs

High compute requirements for training and inference.

Data Requirements

Video datasets with annotated point trajectories.

Deployment Constraints

Real-time performance requirements, computational cost.

Scalability

Scalable with distributed computing resources.

Regulatory Considerations

None.

Production Readiness

Maturity Level

Research

Time to Market

2-5 years for integration into specialized applications.

Patent Potential

Moderate, for the flow matching formulation and GenPT architecture.

View Full Paper Back to Papers