Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov

2025-10-24

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Summary

This paper introduces a new way to test how well computer programs, specifically large language models, can understand and work with changes made to code, known as 'diffs'. It focuses on building a reliable benchmark to measure this ability.

What's the problem?

When developers change code, they create 'diffs' which show exactly what was added, removed, or modified. For programs that automatically edit code, like AI-powered tools, understanding these diffs is crucial. However, there wasn't a good, standardized way to test how well these programs handle diffs, and different ways of representing diffs can impact performance. Existing methods weren't comprehensive enough to really push the limits of these AI models.

What's the solution?

The researchers created a benchmark called 'Diff-XYZ' which includes a large collection of real code changes. This benchmark has three tasks: taking old code and a diff to create new code, taking new code and a diff to recreate old code, and generating the diff itself given the old and new code. They also compared different ways of formatting these diffs to see which works best for different AI models and tasks. They used a dataset of code changes from real projects to make the benchmark realistic.

Why it matters?

This work is important because it provides a standard tool for evaluating and improving AI models that work with code. By understanding how well these models handle diffs, developers can build better automated code editing and refactoring tools. It also helps guide the development of new and improved diff formats that are more efficient for AI to process, ultimately leading to more powerful and reliable coding assistants.

Abstract

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code + diff rightarrow new code), anti-apply (new code - diff rightarrow old code), and diff generation (new code - old code rightarrow diff). Instances in the benchmark are triples langle old code, new code, diff rangle drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

View Paper