Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Constructing accurate digital twins of articulated objects is essential for
robotic simulation training and embodied AI world model building, yet
historically requires painstaking manual modeling or multi-stage pipelines. In
this work, we propose \textbf{URDF-Anything}, an end-to-end automatic
reconstruction framework based on a 3D multimodal large language model (MLLM).
URDF-Anything utilizes an autoregressive prediction framework based on
point-cloud and text multimodal input to jointly optimize geometric
segmentation and kinematic parameter prediction. It implements a specialized
$[SEG]$ token mechanism that interacts directly with point cloud features,
enabling fine-grained part-level segmentation while maintaining consistency
with the kinematic parameter predictions. Experiments on both simulated and
real-world datasets demonstrate that our method significantly outperforms
existing approaches regarding geometric segmentation (mIoU 17\% improvement),
kinematic parameter prediction (average error reduction of 29\%), and physical
executability (surpassing baselines by 50\%). Notably, our method exhibits
excellent generalization ability, performing well even on objects outside the
training set. This work provides an efficient solution for constructing digital
twins for robotic simulation, significantly enhancing the sim-to-real transfer
capability.
Key Contributions
URDF-Anything is an end-to-end framework using a 3D MLLM to automatically reconstruct articulated objects. It jointly optimizes geometric segmentation and kinematic parameter prediction from point-cloud and text inputs, significantly outperforming existing methods in segmentation accuracy.
Business Value
Streamlines the creation of digital twins for robots and virtual environments, accelerating development cycles for embodied AI and robotics simulation. This can reduce costs and time associated with manual 3D modeling.