Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Researchers in generative AI,Computer vision engineers,Robotics developers,Simulation engineers 1 week ago

Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

computer-vision › video-understanding
📄 Abstract

Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
Authors (2)
Aaron Appelle
Jerome P. Lynch
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper proposes a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. It introduces a method to reconstruct 2D bird's-eye view trajectories from pixel space and analyzes the plausibility of multi-agent behavior in generated videos, revealing that leading models have learned effective priors but still exhibit failure modes.

Business Value

Enables the development of more realistic and useful video generation models that can serve as simulators for training autonomous systems, testing urban planning scenarios, or creating immersive virtual environments.