Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper RL Researchers,AI Engineers,Robotics Developers,LLM Application Specialists 1 week ago

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

reinforcement-learning › robotics-rl
📄 Abstract

Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni.
Authors (5)
Qinqing Zheng
Mikael Henaff
Amy Zhang
Aditya Grover
Brandon Amos
Submitted
October 30, 2024
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces ONI, a distributed architecture that enables RL agents to learn intrinsic rewards using LLM feedback in a scalable manner. It addresses limitations of prior methods by allowing asynchronous LLM annotation of agent experience, making it suitable for problems requiring billions of environment samples without needing a pre-existing diverse dataset.

Business Value

Accelerates the development of AI agents for complex tasks by automating reward design, making RL applicable to a wider range of real-world problems where defining explicit rewards is difficult.