Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision Language Models (VLMs) have recently been adopted in robotics for
their capability in common sense reasoning and generalizability. Existing work
has applied VLMs to generate task and motion planning from natural language
instructions and simulate training data for robot learning. In this work, we
explore using VLM to interpret human demonstration videos and generate robot
task planning. Our method integrates keyframe selection, visual perception, and
VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to
''see'' human demonstrations and explain the corresponding plans to the robot
for it to ''do''. To validate our approach, we collected a set of long-horizon
human videos demonstrating pick-and-place tasks in three diverse categories and
designed a set of metrics to comprehensively benchmark SeeDo against several
baselines, including state-of-the-art video-input VLMs. The experiments
demonstrate SeeDo's superior performance. We further deployed the generated
task plans in both a simulation environment and on a real robot arm.