Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
๐ Abstract
Abstract: Recent studies operationalize self-improvement through coding agents that
edit their own codebases. They grow a tree of self-modifications through
expansion strategies that favor higher software engineering benchmark
performance, assuming that this implies more promising subsequent
self-modifications. However, we identify a mismatch between the agent's
self-improvement potential (metaproductivity) and its coding benchmark
performance, namely the Metaproductivity-Performance Mismatch. Inspired by
Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates
the benchmark performances of the descendants of an agent as an indicator of
its potential for self-improvement. We show that, in our self-improving coding
agent development setting, access to the true $\mathrm{CMP}$ is sufficient to
simulate how the G\"odel Machine would behave under certain assumptions. We
introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$
and using it as guidance, searches the tree of self-modifications. On SWE-bench
Verified and Polyglot, HGM outperforms prior self-improving coding agent
development methods while using less wall-clock time. Last but not least, HGM
demonstrates strong transfer to other coding datasets and large language
models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and
evaluated on SWE-bench Lite with GPT-5 achieves human-level performance,
matching the best officially checked results of human-engineered coding agents.
Our code is available at https://github.com/metauto-ai/HGM.
Authors (8)
Wenyi Wang
Piotr Piฤkos
Li Nanbo
Firas Laakom
Yimeng Chen
Mateusz Ostaszewski
+2 more
Submitted
October 24, 2025
Key Contributions
This paper introduces the Metaproductivity-Performance Mismatch (MPM) and proposes the Huxley-Gรถdel Machine (HGM) concept to address the limitations of current self-improving coding agents. By introducing a metric (CMP) that aggregates descendant benchmark performances, HGM aims to better estimate an agent's true self-improvement potential, moving beyond simple coding benchmark scores.
Business Value
Could lead to more capable and autonomous AI development tools, accelerating software creation and AI research. It offers a path towards more robust and genuinely intelligent AI systems.