Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Inspired by the performance and scalability of autoregressive large language
models (LLMs), transformer-based models have seen recent success in the visual
domain. This study investigates a transformer adaptation for video prediction
with a simple end-to-end approach, comparing various spatiotemporal
self-attention layouts. Focusing on causal modeling of physical simulations
over time; a common shortcoming of existing video-generative approaches, we
attempt to isolate spatiotemporal reasoning via physical object tracking
metrics and unsupervised training on physical simulation datasets. We introduce
a simple yet effective pure transformer model for autoregressive video
prediction, utilizing continuous pixel-space representations for video
prediction. Without the need for complex training strategies or latent
feature-learning components, our approach significantly extends the time
horizon for physically accurate predictions by up to 50% when compared with
existing latent-space approaches, while maintaining comparable performance on
common video quality metrics. In addition, we conduct interpretability
experiments to identify network regions that encode information useful to
perform accurate estimations of PDE simulation parameters via probing models,
and find that this generalizes to the estimation of out-of-distribution
simulation parameters. This work serves as a platform for further
attention-based spatiotemporal modeling of videos via a simple, parameter
efficient, and interpretable approach.
Authors (4)
Dean L Slack
G Thomas Hudson
Thomas Winterbottom
Noura Al Moubayed
Submitted
October 23, 2025
IEEE Transactions on Neural Networks and Learning Systems, 36,
19106-19118, 2025
Key Contributions
This paper introduces a pure transformer model for autoregressive video prediction using continuous pixel-space representations. It significantly extends the prediction time horizon for physically accurate predictions by up to 50% compared to existing latent-space methods, by focusing on causal modeling of physical simulations and isolating spatiotemporal reasoning.
Business Value
Enables more accurate and longer-term predictions in dynamic physical environments, crucial for applications like autonomous driving, robotics, and scientific simulations.