Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Numerous heuristics and advanced approaches have been proposed for
exploration in different settings for deep reinforcement learning. Noise-based
exploration generally fares well with dense-shaped rewards and bonus-based
exploration with sparse rewards. However, these methods usually require
additional tuning to deal with undesirable reward settings by adjusting
hyperparameters and noise distributions. Rewards that actively discourage
exploration, i.e., with an action cost and no other dense signal to follow, can
pose a major challenge. We propose a novel exploration method, Stable
Error-seeking Exploration (SEE), that is robust across dense, sparse, and
exploration-adverse reward settings. To this endeavor, we revisit the idea of
maximizing the TD-error as a separate objective. Our method introduces three
design choices to mitigate instability caused by far-off-policy learning, the
conflict of interest of maximizing the cumulative TD-error in an episodic
setting, and the non-stationary nature of TD-errors. SEE can be combined with
off-policy algorithms without modifying the optimization pipeline of the
original objective. In our experimental analysis, we show that a Soft-Actor
Critic agent with the addition of SEE performs robustly across three diverse
reward settings in a variety of tasks without hyperparameter adjustments.
Authors (2)
Sebastian Griesbach
Carlo D'Eramo
Reinforcement Learning Journal, vol. 6, 2025, pp. 1140-1157
Key Contributions
This paper introduces Stable Error-seeking Exploration (SEE), a novel exploration method for deep reinforcement learning that is robust across dense, sparse, and exploration-adverse reward settings. By revisiting and stabilizing the maximization of the TD-error as a separate objective, SEE addresses instability issues caused by off-policy learning and episodic constraints, offering a more generalizable exploration strategy.
Business Value
Enables more efficient and reliable training of reinforcement learning agents in complex environments, particularly those with challenging reward structures. This can accelerate progress in areas like robotics and autonomous systems.