KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran
2025-03-06
Summary
This paper talks about KodCode, a new dataset created to help train AI models to write better computer code. It's like a huge collection of coding homework problems with answers and ways to check if the answers are correct
What's the problem?
Current datasets used to teach AI about coding aren't very good. They either don't cover enough different types of coding problems or don't have a reliable way to check if the code actually works. This makes it hard for AI to learn how to code properly
What's the solution?
The researchers created KodCode, which uses AI to generate a wide range of coding questions, from easy to super hard. For each question, it also creates a correct answer and a way to test if the answer works. They used clever techniques to make sure the questions are diverse and the answers are correct. They even had the AI rewrite questions in different ways to make the dataset more varied
Why it matters?
This matters because it helps make AI better at writing code. When researchers used KodCode to train AI models, those models became better at coding tasks than some of the best existing AI coders. This could lead to AI that can help programmers write better code faster, or even teach coding to students more effectively
Abstract
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.