Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Tracking a point through a video can be a challenging task due to uncertainty
arising from visual obfuscations, such as appearance changes and occlusions.
Although current state-of-the-art discriminative models excel in regressing
long-term point trajectory estimates -- even through occlusions -- they are
limited to regressing to a mean (or mode) in the presence of uncertainty, and
fail to capture multi-modality. To overcome this limitation, we introduce
Generative Point Tracker (GenPT), a generative framework for modelling
multi-modal trajectories. GenPT is trained with a novel flow matching
formulation that combines the iterative refinement of discriminative trackers,
a window-dependent prior for cross-window consistency, and a variance schedule
tuned specifically for point coordinates. We show how our model's generative
capabilities can be leveraged to improve point trajectory estimates by
utilizing a best-first search strategy on generated samples during inference,
guided by the model's own confidence of its predictions. Empirically, we
evaluate GenPT against the current state of the art on the standard
PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a
TAP-Vid variant with additional occlusions to assess occluded point tracking
performance and highlight our model's ability to capture multi-modality. GenPT
is capable of capturing the multi-modality in point trajectories, which
translates to state-of-the-art tracking accuracy on occluded points, while
maintaining competitive tracking accuracy on visible points compared to extant
discriminative point trackers.
Authors (5)
Mattie Tesfaldet
Adam W. Harley
Konstantinos G. Derpanis
Derek Nowrouzezahrai
Christopher Pal
Submitted
October 23, 2025
Key Contributions
This paper introduces GenPT, a generative framework for point tracking that uses flow matching to model multi-modal trajectories, overcoming the limitations of discriminative trackers that regress to a single mean. GenPT combines iterative refinement, cross-window consistency, and a specialized variance schedule, and leverages generative capabilities for improved trajectory estimation via best-first search.
Business Value
Enhances the accuracy and robustness of video analysis systems, enabling applications in autonomous driving, surveillance, robotics, and augmented reality.