Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper RL Researchers,AI Engineers,Data Scientists 2 weeks ago

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

reinforcement-learning › offline-rl
📄 Abstract

Abstract: Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.
Authors (2)
Tal Fiskus
Uri Shaham
Submitted
July 15, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces a novel theoretical result leveraging the Neyman-Rubin framework to establish a causal bound on the factual loss in DRL, analogous to the on-policy loss. By storing past value network outputs in the replay buffer, this method effectively recycles data, significantly improving reward ratios and reducing buffer size requirements.

Business Value

Significantly reduces the data and computational resources required for training DRL agents, making advanced AI applications more accessible and cost-effective.