CodeTracer: Towards Traceable Agent States

Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu

2026-04-14

CodeTracer: Towards Traceable Agent States

Summary

This paper introduces a new system called CodeTracer designed to help developers understand and fix problems with AI agents that write code. These agents are getting more complex, making it hard to figure out *why* they make mistakes.

What's the problem?

As AI coding agents become more sophisticated, they're doing things like running multiple tools at once and following complicated steps to complete tasks. This makes it really difficult to track what the agent is doing and where things go wrong. When an agent makes a small mistake early on, it can lead to a chain of errors that are hard to diagnose because you can't easily see the agent's thought process or how one error caused another. Current methods for debugging these agents either don't work well with complex tasks or require a lot of manual effort, meaning they don't scale up to real-world coding projects.

What's the solution?

The researchers created CodeTracer, which works by carefully collecting all the information generated when an agent runs – things like the code it writes, the commands it uses, and the results it gets. It then organizes this information into a detailed, step-by-step record of everything the agent did, like a family tree of actions. CodeTracer can then pinpoint exactly where the agent first started to go wrong and trace how that initial error affected everything that followed. They also built a dataset, CodeTraceBench, with lots of examples of agent runs to test and improve CodeTracer.

Why it matters?

This work is important because it addresses a major challenge in the development of AI coding assistants. If we can't effectively debug these agents, it will be hard to build reliable and trustworthy tools that can help programmers. CodeTracer provides a way to understand *why* these agents fail, which is crucial for improving their performance and making them more useful in real-world software development.

Abstract

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

View Paper