SelfCodeAlign: Self-Alignment for Code Generation

Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang

2024-11-01

SelfCodeAlign: Self-Alignment for Code Generation

Summary

This paper introduces SelfCodeAlign, a new method for improving code generation in large language models (LLMs) without needing a lot of human input or pre-existing data.

What's the problem?

Many current methods for training code-generating models rely heavily on human-created examples, which can be expensive and time-consuming to gather. This makes it difficult to create efficient systems that can generate accurate code based on user instructions, especially when those instructions vary widely.

What's the solution?

SelfCodeAlign addresses this issue by using the same base model throughout the process of generating training data. It first identifies different coding concepts from high-quality code snippets, then creates new coding tasks based on these concepts. The model generates multiple responses for each task and tests them in a controlled environment to see which ones work best. Only the successful examples are used for further training. This approach allows the model to learn effectively without relying on extensive human annotations.

Why it matters?

This research is significant because it demonstrates that LLMs can improve their coding abilities by learning from their own generated data rather than relying on human-created examples. This could lead to more efficient and scalable methods for developing coding assistants and tools, making it easier for programmers to get help with their work.

Abstract

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.

View Paper