Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Learning open-vocabulary physical skills for simulated agents presents a
significant challenge in artificial intelligence. Current reinforcement
learning approaches face critical limitations: manually designed rewards lack
scalability across diverse tasks, while demonstration-based methods struggle to
generalize beyond their training distribution. We introduce GROVE, a
generalized reward framework that enables open-vocabulary physical skill
learning without manual engineering or task-specific demonstrations. Our key
insight is that Large Language Models(LLMs) and Vision Language Models(VLMs)
provide complementary guidance -- LLMs generate precise physical constraints
capturing task requirements, while VLMs evaluate motion semantics and
naturalness. Through an iterative design process, VLM-based feedback
continuously refines LLM-generated constraints, creating a self-improving
reward system. To bridge the domain gap between simulation and natural images,
we develop Pose2CLIP, a lightweight mapper that efficiently projects agent
poses directly into semantic feature space without computationally expensive
rendering. Extensive experiments across diverse embodiments and learning
paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion
naturalness and 25.7% better task completion scores while training 8.4x faster
than previous methods. These results establish a new foundation for scalable
physical skill acquisition in simulated environments.