Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 85% Match Research Paper AI Researchers,Machine Learning Engineers,Robotics Engineers,Reinforcement Learning Practitioners 3 weeks ago

Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization

reinforcement-learning › rlhf
📄 Abstract

Abstract: Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.
Authors (2)
Sebastian Griesbach
Carlo D'Eramo
Submitted
June 16, 2025
arXiv Category
cs.LG
Reinforcement Learning Journal, vol. 6, 2025, pp. 1140-1157
arXiv PDF

Key Contributions

This paper introduces Stable Error-seeking Exploration (SEE), a novel exploration method for deep reinforcement learning that is robust across dense, sparse, and exploration-adverse reward settings. By revisiting and stabilizing the maximization of the TD-error as a separate objective, SEE addresses instability issues caused by off-policy learning and episodic constraints, offering a more generalizable exploration strategy.

Business Value

Enables more efficient and reliable training of reinforcement learning agents in complex environments, particularly those with challenging reward structures. This can accelerate progress in areas like robotics and autonomous systems.