Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Automatically synthesizing dense rewards from natural language descriptions
is a promising paradigm in reinforcement learning (RL), with applications to
sparse reward problems, open-ended exploration, and hierarchical skill design.
Recent works have made promising steps by exploiting the prior knowledge of
large language models (LLMs). However, these approaches suffer from important
limitations: they are either not scalable to problems requiring billions of
environment samples, due to requiring LLM annotations for each observation, or
they require a diverse offline dataset, which may not exist or be impossible to
collect. In this work, we address these limitations through a combination of
algorithmic and systems-level contributions. We propose ONI, a distributed
architecture that simultaneously learns an RL policy and an intrinsic reward
function using LLM feedback. Our approach annotates the agent's collected
experience via an asynchronous LLM server, which is then distilled into an
intrinsic reward model. We explore a range of algorithmic choices for reward
modeling with varying complexity, including hashing, classification, and
ranking models. Our approach achieves state-of-the-art performance across a
range of challenging tasks from the NetHack Learning Environment, while
removing the need for large offline datasets required by prior work. We make
our code available at https://github.com/facebookresearch/oni.
Authors (5)
Qinqing Zheng
Mikael Henaff
Amy Zhang
Aditya Grover
Brandon Amos
Submitted
October 30, 2024
Key Contributions
Introduces ONI, a distributed architecture that enables RL agents to learn intrinsic rewards using LLM feedback in a scalable manner. It addresses limitations of prior methods by allowing asynchronous LLM annotation of agent experience, making it suitable for problems requiring billions of environment samples without needing a pre-existing diverse dataset.
Business Value
Accelerates the development of AI agents for complex tasks by automating reward design, making RL applicable to a wider range of real-world problems where defining explicit rewards is difficult.