arxiv_ml 95% Match Research Paper AI Researchers,Computer Vision Engineers,Robotics Engineers,Physicists 2 weeks ago

Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

computer-vision › video-understanding

📄 Abstract

Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.

Authors (4)

Dean L Slack

G Thomas Hudson

Thomas Winterbottom

Noura Al Moubayed

Submitted

October 23, 2025

arXiv Category

cs.CV

IEEE Transactions on Neural Networks and Learning Systems, 36, 19106-19118, 2025

arXiv PDF

Key Contributions

This paper introduces a pure transformer model for autoregressive video prediction using continuous pixel-space representations. It significantly extends the prediction time horizon for physically accurate predictions by up to 50% compared to existing latent-space methods, by focusing on causal modeling of physical simulations and isolating spatiotemporal reasoning.

Business Value

Enables more accurate and longer-term predictions in dynamic physical environments, crucial for applications like autonomous driving, robotics, and scientific simulations.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Potentially feasible, as it avoids complex training strategies or latent feature learning, suggesting a simpler implementation.

Limitations Addressed

Short prediction time horizons and lack of physical accuracy in existing video-generative approaches.

Performance Gains

Up to 50% extension in prediction time horizon compared to existing latent-space methods.

Technical Tags

spatiotemporal transformersautoregressive modelingpixel-space predictionphysical simulationsobject trackingself-attentioncausal modelingtransformer adaptationvideo predictioncontinuous representations

Research Topics

Video PredictionSpatiotemporal ReasoningPhysical Simulation ModelingDeep Learning ArchitecturesAutoregressive Models

Methods & Architectures

Spatiotemporal Self-AttentionAutoregressive TransformerPixel-Space Representation Transformer

Applications & Tasks

Robotics Autonomous Systems Physics Simulation Predictive ModelingTime Series ForecastingCausal Inference Video PredictionPhysical Simulation Forecasting

Related Fields

Computer VisionMachine LearningPhysics SimulationDeep Learning

Keywords

video predictiontransformersspatiotemporalautoregressivephysical simulationspixel-spaceself-attentioncausal modelingobject trackingdeep learningtime seriesforecastingcontinuous representation

Academic Context

#Video Prediction#Spatiotemporal Reasoning#Physical Simulation Modeling#Deep Learning Architectures#Autoregressive Models

Commercial Potential

Potential Products

Advanced simulation softwarePredictive maintenance systemsRobotics control systems

Target Industries

ManufacturingAutomotiveGamingScientific Research

Use Case Examples

Predicting the trajectory of falling objectsSimulating fluid dynamicsForecasting robot arm movements

Competitive Edge

Offers longer prediction horizons and better physical accuracy than existing latent-space video prediction models.

Market Opportunity

Growing market for AI-driven simulation and prediction tools.

Revenue Models

Licensing of technologySaaS for simulation services.

Resource Requirements

Compute Needs

Likely high due to transformer architecture and large datasets.

Data Requirements

Requires datasets of physical simulations, potentially with object tracking information.

Deployment Constraints

May require significant computational resources for real-time prediction.

Scalability

Transformer architectures are generally scalable, but computational cost for long sequences can be a factor.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate, for novel architectural components or training methods.

View Full Paper Back to Papers