arxiv_ml 90% Match Research Paper ML engineers,Data scientists,Product managers,Researchers in RL and OPE 1 day ago

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

reinforcement-learning › offline-rl

📄 Abstract

Abstract: With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.

Authors (5)

Jie JW Wu

Ayanda Patrick Herlihy

Ahmad Saleem Mirza

Ali Afoud

Fatemeh Fard

Submitted

November 2, 2025

arXiv Category

cs.SE

arXiv PDF

Key Contributions

GrowthHacker proposes using LLM-based agents to optimize Off-Policy Evaluation (OPE) performance through iterative code modification. It introduces a benchmark for evaluating LLMs in this task, demonstrating their potential to improve offline A/B testing results using logged data, thereby reducing the need for costly and risky online experiments.

Business Value

Enables faster, cheaper, and safer evaluation of new technologies and product features by optimizing offline A/B testing, leading to quicker product iteration and data-driven decision-making.

Paper Metadata

Innovation Type

Framework/Benchmark

Deployment Feasibility

Moderate. Requires integration of LLM agents into existing experimentation pipelines and careful validation.

Limitations Addressed

High resource requirements, potential negative user impact, and long data collection periods associated with online A/B testing; limited exploration of LLMs for OPE optimization.

Performance Gains

Improved OPE results through code optimization.

Technical Tags

off-policy evaluation (OPE)LLM agentscode optimizationoffline A/B testinglogged datareinforcement learningGrowthHackeriterative optimizationreal-world datasetsagentic AI

Research Topics

Off-Policy EvaluationLLM Agents for OptimizationOffline Reinforcement LearningAutomated Code GenerationA/B Testing Optimization

Methods & Architectures

LLM-based code optimizationIterative evaluation and refinementAgentic AI approachBenchmark development Large Language Models (LLMs)LLM-based Agents

Applications & Tasks

Software Development Online Experimentation Reinforcement Learning Healthcare Recommender Systems Improving OPE performance via code optimizationReducing resource requirements and risks of online A/B testingLeveraging LLMs for offline evaluation Off-Policy EvaluationCode optimization for OPEAutomated experimentation

Datasets & Benchmarks

Datasets

Large-scale real-world datasets

OPE performance metrics

Related Fields

Reinforcement LearningMachine LearningSoftware EngineeringData ScienceOnline Experimentation

Keywords

Off-Policy EvaluationLLM AgentsCode OptimizationOffline A/B TestingReinforcement LearningGrowthHackerAgentic AILogged DataExperimentationHealthcareRecommender SystemsDialog SystemsRobotics

Academic Context

#Off-Policy Evaluation#LLM Agents for Optimization#Offline Reinforcement Learning#Automated Code Generation#A/B Testing Optimization

Technology Stack

Frameworks & Libraries

LLM agents

Commercial Potential

Potential Products

Automated experimentation platformsOPE optimization services

Target Industries

TechnologyE-commerceHealthcareFinanceMedia

Use Case Examples

Optimizing recommendation algorithms using logged user interaction dataEvaluating new features for a healthcare platform offlineImproving dialogue system performance through OPE

Competitive Edge

Pioneers the use of LLM agents for optimizing OPE, offering a novel automated approach to improve offline evaluation accuracy.

Market Opportunity

Large market for A/B testing and experimentation tools, growing interest in offline evaluation methods.

Revenue Models

SaaS platformconsulting services.

Resource Requirements

Compute Needs

Moderate to high, depending on the complexity of the code optimization tasks and LLM inference.

Data Requirements

Logged data from previous experiments or system interactions.

Deployment Constraints

Requires careful validation of optimized code and OPE results to ensure reliability.

Scalability

Scalability depends on the LLM's ability to handle complex code and the efficiency of the iterative optimization loop.

Production Readiness

Maturity Level

Research

Time to Market

2-3 years

Patent Potential

Moderate, for the GrowthHacker methodology and benchmark.

View Full Paper Back to Papers