Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded
Transformer (VGGT), have shown strong capability in inferring 3D attributes of
static scenes. However, since they are typically trained on static datasets,
these models often struggle in real-world scenarios involving complex dynamic
elements, such as moving humans or deformable objects like umbrellas. To
address this limitation, we introduce PAGE-4D, a feedforward model that extends
VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and
point cloud reconstruction -- all without post-processing. A central challenge
in multi-task 4D reconstruction is the inherent conflict between tasks:
accurate camera pose estimation requires suppressing dynamic regions, while
geometry reconstruction requires modeling them. To resolve this tension, we
propose a dynamics-aware aggregator that disentangles static and dynamic
information by predicting a dynamics-aware mask -- suppressing motion cues for
pose estimation while amplifying them for geometry reconstruction. Extensive
experiments show that PAGE-4D consistently outperforms the original VGGT in
dynamic scenarios, achieving superior results in camera pose estimation,
monocular and video depth estimation, and dense point map reconstruction.
Authors (8)
Kaichen Zhou
Yuhan Wang
Grace Chen
Xinhai Chang
Gaspard Beaudouin
Fangneng Zhan
+2 more
Submitted
October 20, 2025
Key Contributions
PAGE-4D extends existing 3D feedforward models to dynamic scenes by introducing a dynamics-aware aggregator and a mask prediction mechanism. This allows for accurate camera pose estimation, depth prediction, and point cloud reconstruction of moving objects without post-processing, addressing a key limitation of static-scene models.
Business Value
Enables more robust and accurate 3D perception for applications like autonomous vehicles and robotics, which require understanding complex, dynamic environments.