From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo

2025-12-02

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Summary

This paper is a deep dive into how large language models, or LLMs, are being used to write code automatically. It looks at how these models have improved dramatically in recent years, going from barely working to being able to successfully generate code a huge percentage of the time, and explores the entire process of building and improving these code-writing AI systems.

What's the problem?

While LLMs are getting really good at writing code, there's a gap between what researchers are testing in labs and what actually matters when developers are building real-world software. Things like making sure the code is correct, secure, understands existing large projects, and fits into how developers already work are all challenges. The paper identifies that we need to better connect academic research with the practical needs of software development.

What's the solution?

The authors thoroughly examined the whole lifecycle of building these code LLMs, starting with the data used to train them, then how they're fine-tuned and improved using different techniques like reinforcement learning. They compared several popular LLMs – both general ones like GPT-4 and those specifically designed for code – and ran experiments to figure out what works best for pre-training, fine-tuning, and reinforcement learning, looking at things like how much data is needed and which settings are most important.

Why it matters?

This work is important because it provides a comprehensive guide for both researchers and developers working with code LLMs. It helps bridge the gap between theory and practice, pointing out where more research is needed to make these tools even more useful and reliable for building real-world software. Ultimately, it aims to accelerate the adoption of AI-powered coding assistants and improve the software development process.

Abstract

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

View Paper