arxiv_ai 93% Match Research Paper RL Researchers,AI Engineers,Robotics Developers,LLM Application Specialists 1 week ago

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

reinforcement-learning › robotics-rl

📄 Abstract

Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni.

Authors (5)

Qinqing Zheng

Mikael Henaff

Amy Zhang

Aditya Grover

Brandon Amos

Submitted

October 30, 2024

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces ONI, a distributed architecture that enables RL agents to learn intrinsic rewards using LLM feedback in a scalable manner. It addresses limitations of prior methods by allowing asynchronous LLM annotation of agent experience, making it suitable for problems requiring billions of environment samples without needing a pre-existing diverse dataset.

Business Value

Accelerates the development of AI agents for complex tasks by automating reward design, making RL applicable to a wider range of real-world problems where defining explicit rewards is difficult.

Paper Metadata

Innovation Type

System Architecture and Algorithmic Approach

Deployment Feasibility

Moderate to High. The distributed architecture and asynchronous processing are designed for scalability. Requires integration of LLM services and RL training infrastructure.

Limitations Addressed

Scalability issues with LLM-annotated rewards in RL and the requirement for large offline datasets in previous approaches.

Performance Gains

Enables learning in environments with sparse rewards and facilitates open-ended exploration, leading to more capable agents in complex scenarios.

Technical Tags

reinforcement learning (RL)intrinsic rewardslarge language models (LLMs)LLM feedbacksparse rewardsopen-ended explorationhierarchical skill designdistributed architectureasynchronous annotationpolicy learningreward function learning

Research Topics

Reinforcement LearningReward EngineeringLLM Applications in RLExploration StrategiesDistributed AI Systems

Methods & Architectures

ONI (Online Intrinsic Rewards) architectureLLM-based reward synthesisAsynchronous LLM annotation serverPolicy and reward function learning Large Language Models (LLMs)Reinforcement Learning Agents

Applications & Tasks

Robotics Game Playing Simulation Environments Autonomous Agents Synthesizing Dense Rewards from Natural LanguageScalable RL with LLM FeedbackAddressing Sparse Reward Problems Learning Complex Tasks with Sparse RewardsOpen-ended ExplorationSkill Discovery

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingRoboticsSoftware Engineering

Keywords

reinforcement learningintrinsic rewardsLLMreward synthesissparse rewardsexplorationdistributed systemonline learningAIroboticsdecision making

Academic Context

#Reinforcement Learning#Reward Engineering#LLM Applications in RL#Exploration Strategies#Distributed AI Systems

Technology Stack

ML Infrastructure

Distributed training systems

Commercial Potential

Potential Products

RL platforms with automated reward generationAI agents capable of complex task learning in unstructured environments

Target Industries

RoboticsGamingAutonomous SystemsLogisticsSimulation

Use Case Examples

Training robots to perform complex manipulation tasks without explicit reward functionsDeveloping AI agents for open-world games that can explore and learn diverse skillsAutomating the design of hierarchical skills for autonomous agents

Competitive Edge

Offers a scalable and online approach to LLM-guided reward synthesis, overcoming the limitations of batch processing and fixed datasets seen in prior work.

Market Opportunity

Large and growing market for AI agents capable of learning complex behaviors.

Revenue Models

Licensing of the ONI frameworkdevelopment of specialized AI agents.

Resource Requirements

Compute Needs

Requires significant distributed compute resources for both the RL agent training and the LLM annotation server.

Data Requirements

Requires interaction with an environment (real or simulated) to collect experience data for LLM annotation.

Deployment Constraints

Reliability and cost of LLM API calls, latency in annotation feedback loop, and potential biases in LLM feedback need careful management.

Scalability

Designed for scalability through its distributed architecture and asynchronous processing.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust implementations and integration into RL frameworks.

View Full Paper Back to Papers