📄 Abstract
Abstract: With the software industry shifting toward a data-driven culture, online A/B
testing is a key tool for evaluating new technologies. However, deploying such
experiments requires substantial resources, may negatively impact users, and
involves long data collection periods. To address this, \textit{off-policy
evaluation (OPE)}, or offline A/B testing, uses logged data to assess
technologies and is fundamental in Reinforcement Learning, making it crucial in
domains where online testing is costly or risky, such as healthcare,
recommender systems, education, dialog systems, and robotics. Despite advances
in coding LLMs and agentic AI, little is known about leveraging them to
optimize OPE results. We investigate whether LLMs and LLM-based agents can
improve OPE performance via code optimization. We propose
\textit{GrowthHacker}, a benchmark with agent and baseline methods on
large-scale real-world datasets, which iteratively optimizes code, evaluates
results, and begins new optimization cycles. We collected datasets, established
protocols, implemented baselines for OPE on the Open Bandit Pipeline
(OBP)~\cite{saito2021openbanditdatasetpipeline} and
Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent}
framework, which reduces system complexity while preserving optimization
effectiveness. Results show the two_agent framework achieves 100% reliability
and the highest average improvement of 106.7% among positive outcomes. Both
two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%.
These findings demonstrate the feasibility of LLM-based agents as automated
"growth hackers" to enhance OPE systems, with implications for scaling
data-driven decision-making in production.
Authors (5)
Jie JW Wu
Ayanda Patrick Herlihy
Ahmad Saleem Mirza
Ali Afoud
Fatemeh Fard
Submitted
November 2, 2025
Key Contributions
GrowthHacker proposes using LLM-based agents to optimize Off-Policy Evaluation (OPE) performance through iterative code modification. It introduces a benchmark for evaluating LLMs in this task, demonstrating their potential to improve offline A/B testing results using logged data, thereby reducing the need for costly and risky online experiments.
Business Value
Enables faster, cheaper, and safer evaluation of new technologies and product features by optimizing offline A/B testing, leading to quicker product iteration and data-driven decision-making.