GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
2025-07-17
Summary
This paper talks about GitChameleon, a new dataset designed to test how well AI models can generate Python code that works correctly with specific versions of software libraries.
What's the problem?
The problem is that software libraries often change over time, and AI models struggle to write code that matches the exact library version being used, which can cause errors or bugs in real-world programming.
What's the solution?
The authors created a carefully collected set of 328 coding problems, each linked to a specific library version and accompanied by tests that run the code to check if it works properly. They used this dataset to evaluate various AI code generators and showed that even the best models have a hard time handling version-specific requirements correctly.
Why it matters?
This matters because being able to generate code that works with the right library version is important for reliable software development, especially in professional environments where updating libraries frequently is not always possible.
Abstract
GitChameleon is a dataset for evaluating version-conditioned code generation by large language models, LLM-powered agents, code assistants, and RAG systems using execution-based tests.