Code Aesthetics with Agentic Reward Feedback
Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
2025-10-28
Summary
This paper focuses on improving how Large Language Models (LLMs) write code, specifically making the code look nicer and more organized, not just if it works.
What's the problem?
LLMs are really good at making code that *functions* correctly, meaning it does what it's supposed to do. However, the code they generate often looks messy and isn't formatted well, making it harder for humans to read and understand. Essentially, LLMs prioritize getting the code to work over making it aesthetically pleasing.
What's the solution?
The researchers created a system with a few key parts. First, they built a huge dataset of code examples specifically focused on good code style. Then, they designed a way for multiple 'agents' to automatically evaluate code, checking if it runs, how it looks statically (like spacing and indentation), and how it looks when you interact with it. Finally, they used these evaluations to 'train' an LLM, called GRPO-AR, to write code that is both functional *and* visually appealing. They also created a new benchmark, OpenDesign, to measure code aesthetics.
Why it matters?
This work is important because well-formatted code is crucial for collaboration and long-term maintenance. If code is easy to read, it's easier to find and fix bugs, and for other developers to contribute. The researchers showed their method, AesCoder-4B, can produce code that's as good as, or even better than, much larger and more complex models like GPT-4o and GPT-4, demonstrating that focusing on aesthetics can significantly improve LLM code generation.
Abstract
Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.