arxiv_ai 95% Match Research Paper Robotics Researchers,ML Engineers in Robotics,AI Researchers 1 week ago

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

robotics › manipulation

📄 Abstract

Abstract: Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

Authors (6)

Jingyi Tian

Le Wang

Sanping Zhou

Sen Wang

Jiayi Li

Gang Hua

Submitted

October 28, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

DynaRend introduces a novel representation learning framework that jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. This is achieved through masked reconstruction and future prediction using differentiable volumetric rendering on multi-view RGB-D video data, addressing the limitations of 2D vision pretraining and 2D dynamics models for robotic manipulation.

Business Value

Enables more robust and generalizable robotic manipulation systems by learning richer 3D representations from readily available sensor data, potentially reducing the need for extensive real-world robot training.

Paper Metadata

Innovation Type

Novel Framework/Methodology

Deployment Feasibility

Moderate. Requires multi-view RGB-D data and computational resources for training, but learned representations can potentially be deployed on robotic systems.

Limitations Addressed

Scarcity of diverse real-world training data for robotic manipulation, limitations of 2D vision pretraining (static semantics/geometry), and limitations of 2D video prediction models (2D dynamics).

Technical Tags

representation learning3D-aware featuresfuture predictiondifferentiable renderingtriplane representationmasked reconstructionself-supervised learningmulti-view RGB-Drobotic manipulationdynamics learning

Research Topics

Robotic ManipulationRepresentation Learning3D Scene UnderstandingSelf-Supervised LearningRobotics Control

Methods & Architectures

Masked Future RenderingDifferentiable Volumetric RenderingMasked ReconstructionFuture Prediction Triplane Representation

Applications & Tasks

Robotics Industrial Automation Generalizable Robotic ManipulationData Scarcity in Real-World TrainingLearning 3D Dynamics Robotic Manipulation Policy Learning3D Feature Learning

Related Fields

Computer VisionMachine LearningRobotics3D Reconstruction

Keywords

RoboticsManipulationRepresentation Learning3D VisionSelf-Supervised LearningDifferentiable RenderingFuture PredictionTriplaneRGB-DDynamicsGeneralizationData Scarcity

Academic Context

#Robotic Manipulation#Representation Learning#3D Scene Understanding#Self-Supervised Learning#Robotics Control

Commercial Potential

Potential Products

Advanced Robotic Control SystemsSimulation Environments for Robotics

Target Industries

ManufacturingLogisticsWarehousingAutonomous Systems

Use Case Examples

Robots learning to grasp and manipulate diverse objectsRobots performing complex assembly tasksRobots adapting to new environments and objects

Competitive Edge

Offers a more integrated approach to learning 3D geometry, semantics, and dynamics compared to methods focusing on only 2D vision or 2D dynamics.

Market Opportunity

Growing market for intelligent automation and robotics.

Revenue Models

Licensing of technologyintegration into robotic platforms.

Resource Requirements

Compute Needs

High (for training)

Data Requirements

Multi-view RGB-D video data

Deployment Constraints

Requires accurate depth sensing and potentially multiple viewpoints for optimal performance.

Scalability

Scalability depends on the efficiency of the triplane representation and rendering process.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate

View Full Paper Back to Papers