Agentic Refactoring: An Empirical Study of AI Coding Agents
Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan
2025-11-13
Summary
This paper investigates how AI tools are being used to automatically improve existing computer code, a process called refactoring. It looks at a large number of real-world examples to understand what kinds of changes these AI agents are making and whether those changes actually improve the code.
What's the problem?
While AI tools are becoming popular for refactoring code, there wasn't much research on *how* developers are actually using them, how the AI's approach compares to a human refactoring code, and if the AI is actually making the code better. Essentially, we didn't know if these AI tools were helpful or just changing things randomly.
What's the solution?
Researchers analyzed over 15,000 instances of code changes made by AI agents in open-source Java projects. They looked at what types of refactoring the AI was doing – things like renaming variables or changing data types – and what reasons the AI gave for making those changes. They also used code quality metrics to see if the AI’s changes actually improved the code's structure and complexity.
Why it matters?
This research is important because it provides the first large-scale look at how AI is impacting software development. It shows that AI agents are frequently used for refactoring, but they tend to focus on small, localized improvements like fixing inconsistencies rather than making big design changes. It also shows that these AI-driven refactorings do lead to small, measurable improvements in code quality, which suggests these tools can be valuable for developers.
Abstract
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human-driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, reducing class size and complexity (e.g., Class LOC median Δ = -15.25).