Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts
for each problem and reward them independently. This optimizes for pass@1
performance and prioritizes the strength of isolated samples at the expense of
the diversity and collective utility of sets of samples. This under-utilizes
the sampling capacity, limiting exploration and eventual improvement on harder
examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a
transformation on the final rewards which leads to direct optimization of
pass@k performance, thus optimizing for sets of samples that maximize reward
when considered jointly. Our contribution is to derive novel low variance
unbiased estimators for pass@k and its gradient, in both the binary and
continuous reward settings. We show optimization with our estimators reduces to
standard RL with rewards that have been jointly transformed by a stable and
efficient transformation function.
While previous efforts are restricted to k=n, ours is the first to enable
robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of
trading off pass@1 performance for pass@k gains, our method allows annealing k
during training, optimizing both metrics and often achieving strong pass@1
numbers alongside significant pass@k gains.
We validate our reward transformations on toy experiments, which reveal the
variance reducing properties of our formulations. We also include real-world
examples using the open-source LLM, GEMMA-2. We find that our transformation
effectively optimizes for the target k. Furthermore, higher k values enable
solving more and harder problems, while annealing k boosts both the pass@1 and
pass@k . Crucially, for challenging task sets where conventional pass@1
optimization stalls, our pass@k approach unblocks learning, likely due to
better exploration by prioritizing joint utility over the utility of individual
samples.
Authors (2)
Christian Walder
Deep Karkhanis
Key Contributions
Introduces Pass@k Policy Optimization (PKPO), a novel reward transformation technique that enables direct optimization of pass@k performance in RL. It provides low-variance unbiased estimators for pass@k and its gradient, allowing standard RL algorithms to effectively tackle harder problems by optimizing for sets of samples rather than isolated ones.
Business Value
Accelerates the development of intelligent agents capable of solving complex real-world tasks, such as robotic manipulation or automated code generation, leading to increased automation and efficiency.