arxiv_cv 95% Match Research Paper Researchers in generative AI,Computer vision engineers,Robotics developers,Simulation engineers 1 week ago

Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

computer-vision › video-understanding

📄 Abstract

Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

Authors (2)

Aaron Appelle

Jerome P. Lynch

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper proposes a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. It introduces a method to reconstruct 2D bird's-eye view trajectories from pixel space and analyzes the plausibility of multi-agent behavior in generated videos, revealing that leading models have learned effective priors but still exhibit failure modes.

Business Value

Enables the development of more realistic and useful video generation models that can serve as simulators for training autonomous systems, testing urban planning scenarios, or creating immersive virtual environments.

Paper Metadata

Innovation Type

Evaluation Methodology

Deployment Feasibility

The evaluation protocol itself is highly feasible. The feasibility of using video models as simulators depends on the quality and controllability of the models.

Limitations Addressed

Addresses the lack of verification for multi-agent dynamics in generated videos, focusing on scenes with multiple interacting people rather than individual subjects. It also tackles the challenge of evaluating generative models as simulators without explicit simulation environments.

Technical Tags

video generation modelspedestrian dynamicsmulti-agent simulationtext-to-videoimage-to-videoevaluation protocolbird's-eye view trajectoriesplausible behaviorscene understandinggenerative models

Research Topics

Computer VisionGenerative AISimulationRoboticsHuman Behavior Modeling

Methods & Architectures

Rigorous evaluation protocolReconstruction of 2D bird's-eye view trajectoriesPrompt suite for T2VStart frame utilization for I2V Text-to-Video (T2V) modelsImage-to-Video (I2V) models

Applications & Tasks

Autonomous Driving Urban Planning Surveillance Virtual Reality Robotics Video Generation EvaluationMulti-Agent SimulationHuman Behavior PredictionScene Plausibility Evaluating video generation models as simulators of multi-person pedestrian trajectoriesBenchmarking T2V and I2V models for realistic multi-agent dynamics

Related Fields

Computer VisionGenerative ModelsSimulationRoboticsHuman-Computer Interaction

Keywords

video generationpedestrian dynamicsmulti-agent systemssimulationtext-to-videoimage-to-videoevaluationtrajectory predictionautonomous drivingscene understandinggenerative AIbehavior modeling

Academic Context

#Computer Vision#Generative AI#Simulation#Robotics#Human Behavior Modeling

Commercial Potential

Potential Products

Realistic simulation environments for autonomous vehicle trainingTools for generating diverse crowd simulation scenariosSynthetic data generation for pedestrian behavior analysis

Target Industries

AutomotiveUrban PlanningSecurityGamingSimulation

Use Case Examples

Simulating pedestrian traffic flow in a city square for urban planning.Generating realistic scenarios for training autonomous vehicle perception systems.Creating diverse crowd interactions for video game environments.

Competitive Edge

Provides a novel and rigorous framework for evaluating video generation models specifically for their ability to simulate realistic multi-agent pedestrian dynamics, a capability not typically the primary focus of existing video generation benchmarks.

Market Opportunity

Growing interest in generative models for simulation and synthetic data.

Revenue Models

Consulting services for evaluating generative modelsdevelopment of specialized simulation tools.

Resource Requirements

Compute Needs

Evaluating large video generation models requires substantial GPU resources.

Data Requirements

Requires datasets with multi-person pedestrian trajectories and potentially diverse starting frames or text prompts.

Deployment Constraints

Accuracy and realism of generated pedestrian behavior, computational cost of generation and evaluation.

Scalability

Scalability of the evaluation protocol itself is good; scalability of using video models as simulators depends on the underlying model.

Regulatory Considerations

Ethical considerations regarding the generation of realistic human behavior.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for adoption of the evaluation methodology by researchers and developers.

Patent Potential

Low, primarily an evaluation methodology.

View Full Paper Back to Papers