CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen
2025-02-28
Summary
This paper talks about CODESYNC, a new tool designed to help AI language models (LLMs) keep up with changes in programming languages and libraries. It also introduces CODESYNCBENCH, a way to test how well these AI models can adapt to these changes.
What's the problem?
AI language models are great at coding tasks, but they struggle to keep up with frequent changes in programming libraries. This is because they're trained on old data, which can lead to them writing code that doesn't work or isn't safe and efficient.
What's the solution?
The researchers created CODESYNC, which finds outdated code and collects up-to-date information from Python libraries. They also made CODESYNCBENCH, a test set with 3,300 examples covering 220 updates in six Python libraries. This helps measure how well AI models can adapt to code changes. They tested 14 top AI models and found that even the best ones have trouble keeping up with these changes.
Why it matters?
This matters because as programming languages and libraries evolve, we need AI coding assistants that can keep up. CODESYNC and CODESYNCBENCH provide tools for researchers to develop better methods for updating AI models' knowledge in real-time. This could lead to more reliable and efficient AI coding assistants, helping programmers write better, safer code more quickly.
Abstract
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop <PRE_TAG>CODESYNCBENCH</POST_TAG>, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.