Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: A key challenge in robot manipulation lies in developing policy models with
strong spatial understanding, the ability to reason about 3D geometry, object
relations, and robot embodiment. Existing methods often fall short: 3D point
cloud models lack semantic abstraction, while 2D image encoders struggle with
spatial reasoning. To address this, we propose SEM (Spatial Enhanced
Manipulation model), a novel diffusion-based policy framework that explicitly
enhances spatial understanding from two complementary perspectives. A spatial
enhancer augments visual representations with 3D geometric context, while a
robot state encoder captures embodiment-aware structure through graphbased
modeling of joint dependencies. By integrating these modules, SEM significantly
improves spatial understanding, leading to robust and generalizable
manipulation across diverse tasks that outperform existing baselines.