Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Learning generalizable robotic manipulation policies remains a key challenge
due to the scarcity of diverse real-world training data. While recent
approaches have attempted to mitigate this through self-supervised
representation learning, most either rely on 2D vision pretraining paradigms
such as masked image modeling, which primarily focus on static semantics or
scene geometry, or utilize large-scale video prediction models that emphasize
2D dynamics, thus failing to jointly learn the geometry, semantics, and
dynamics required for effective manipulation. In this paper, we present
DynaRend, a representation learning framework that learns 3D-aware and
dynamics-informed triplane features via masked reconstruction and future
prediction using differentiable volumetric rendering. By pretraining on
multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future
dynamics, and task semantics in a unified triplane representation. The learned
representations can be effectively transferred to downstream robotic
manipulation tasks via action value map prediction. We evaluate DynaRend on two
challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic
experiments, demonstrating substantial improvements in policy success rate,
generalization to environmental perturbations, and real-world applicability
across diverse manipulation tasks.
Authors (6)
Jingyi Tian
Le Wang
Sanping Zhou
Sen Wang
Jiayi Li
Gang Hua
Submitted
October 28, 2025
Key Contributions
DynaRend introduces a novel representation learning framework that jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. This is achieved through masked reconstruction and future prediction using differentiable volumetric rendering on multi-view RGB-D video data, addressing the limitations of 2D vision pretraining and 2D dynamics models for robotic manipulation.
Business Value
Enables more robust and generalizable robotic manipulation systems by learning richer 3D representations from readily available sensor data, potentially reducing the need for extensive real-world robot training.