Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Deep reinforcement learning (DRL) agents excel in solving complex
decision-making tasks across various domains. However, they often require a
substantial number of training steps and a vast experience replay buffer,
leading to significant computational and resource demands. To address these
challenges, we introduce a novel theoretical result that leverages the
Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that
focus on bounding the counterfactual loss, we establish a causal bound on the
factual loss, which is analogous to the on-policy loss in DRL. This bound is
computed by storing past value network outputs in the experience replay buffer,
effectively utilizing data that is usually discarded. Extensive experiments
across the Atari 2600 and MuJoCo domains on various agents, such as DQN and
SAC, achieve up to 383% higher reward ratio, outperforming the same agents
without our proposed term, and reducing the experience replay buffer size by up
to 96%, significantly improving sample efficiency at a negligible cost.
Key Contributions
Introduces a novel theoretical result leveraging the Neyman-Rubin framework to establish a causal bound on the factual loss in DRL, analogous to the on-policy loss. By storing past value network outputs in the replay buffer, this method effectively recycles data, significantly improving reward ratios and reducing buffer size requirements.
Business Value
Significantly reduces the data and computational resources required for training DRL agents, making advanced AI applications more accessible and cost-effective.