Understanding the Challenges in Iterative Generative Optimization with LLMs
Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng
2026-03-26
Summary
This paper explores a technique called generative optimization, which uses powerful AI language models to automatically improve things like computer code or instructions by testing and learning from the results. However, it finds that this method isn't as reliable as it could be in real-world situations.
What's the problem?
The main issue is that setting up generative optimization isn't straightforward. It's not just about having the AI; engineers need to make several crucial, but often overlooked, decisions about *how* the AI learns. Specifically, they need to decide what parts of the code or instructions the AI is allowed to change, how much information from each test run the AI should use to improve, and how many test runs should be grouped together before the AI updates itself. These choices significantly impact whether the optimization works, but they aren't usually discussed in detail.
What's the solution?
The researchers investigated how these three key design choices – the initial starting point, how much test data to use (the 'credit horizon'), and how to group test runs – affect the success of generative optimization. They tested their ideas on different challenges, including improving AI agents in games and solving difficult reasoning problems. By carefully analyzing the results, they showed that these choices are critical and can make or break the optimization process.
Why it matters?
This work is important because it highlights a major obstacle to making generative optimization a practical tool. If we can't easily figure out the best way to set up these learning loops, it will be hard to widely adopt this promising technology. The paper provides practical advice to help engineers make better decisions when implementing generative optimization, ultimately moving the field closer to building truly self-improving AI systems.
Abstract
Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.