Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Offline goal-conditioned reinforcement learning (GCRL) offers a practical
learning paradigm in which goal-reaching policies are trained from abundant
state-action trajectory datasets without additional environment interaction.
However, offline GCRL still struggles with long-horizon tasks, even with recent
advances that employ hierarchical policy structures, such as HIQL. Identifying
the root cause of this challenge, we observe the following insight. Firstly,
performance bottlenecks mainly stem from the high-level policy's inability to
generate appropriate subgoals. Secondly, when learning the high-level policy in
the long-horizon regime, the sign of the advantage estimate frequently becomes
incorrect. Thus, we argue that improving the value function to produce a clear
advantage estimate for learning the high-level policy is essential. In this
paper, we propose a simple yet effective solution: Option-aware Temporally
Abstracted value learning, dubbed OTA, which incorporates temporal abstraction
into the temporal-difference learning process. By modifying the value update to
be option-aware, our approach contracts the effective horizon length, enabling
better advantage estimates even in long-horizon regimes. We experimentally show
that the high-level policy learned using the OTA value function achieves strong
performance on complex tasks from OGBench, a recently proposed offline GCRL
benchmark, including maze navigation and visual robotic manipulation
environments.
Key Contributions
This paper proposes Option-aware Temporally Abstracted value learning (OTA) to address challenges in offline goal-conditioned reinforcement learning (GCRL) for long-horizon tasks. OTA improves the high-level policy's ability to generate appropriate subgoals and corrects inaccurate advantage estimates by enhancing the value function, leading to better performance in complex sequential decision-making problems.
Business Value
Enables more efficient and effective training of autonomous agents from pre-collected data, reducing the need for costly real-world interaction, particularly for complex tasks.