AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering
Di Zhang
2026-01-09
Summary
This paper introduces a new way to improve large language model (LLM) agents, treating them more like traditional software that goes through a careful release process.
What's the problem?
Currently, improving LLM agents often involves letting them try to fix themselves or running many different versions at once. While this can lead to better scores, it's hard to understand *why* an agent is improving, if it's actually getting better consistently, or what went wrong when it fails. It's like trying to fix a car by randomly changing parts and hoping for the best – you don't know what actually helped or hurt.
What's the solution?
The researchers developed a system called AgentDevel. It works by first running the agent and carefully observing what kinds of errors it makes, without looking at the agent’s internal code. Then, it figures out the most common problems and creates a single, improved version of the agent based on those observations. Finally, it tests this new version very carefully, making sure it doesn’t break anything that used to work. Think of it like a software update: test, fix bugs, and then release, ensuring nothing essential stops working.
Why it matters?
This approach is important because it makes LLM agents more reliable and easier to debug. By focusing on consistent improvement and preventing regressions (where a fix breaks something else), AgentDevel allows developers to build and release these agents as trustworthy software, rather than relying on unpredictable self-improvement methods.
Abstract
Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as release engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce AgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.