InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen

2024-07-09

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Summary

This paper talks about InverseCoder, a new approach to improve code-generating AI models by using a technique called Inverse-Instruct. This method allows the models to create better instructions for coding tasks by generating data from themselves instead of relying on other closed-source models.

What's the problem?

The main problem is that while existing large language models (LLMs) can generate code well, they often need to be fine-tuned using data from powerful but closed-source models like GPT-3.5 and GPT-4. This can limit their effectiveness and create a mismatch in how they understand formal coding language versus informal natural language. Essentially, it’s easier for these models to translate code into plain language than to do the opposite.

What's the solution?

To solve this issue, the authors propose a method called Inverse-Instruct, where the model summarizes instructions from code snippets instead of just learning from examples. They start with a set of coding instructions and use the model to generate new, high-quality instructions based on those codes. Then, they combine the original instructions with the newly generated ones and fine-tune the model on this improved dataset. This process results in a series of models named InverseCoder that perform better on various coding tasks compared to earlier models.

Why it matters?

This research is important because it enhances how AI can assist with coding tasks, making it more effective and versatile. By improving the instruction-tuning process, InverseCoder can help developers write code more efficiently across different programming languages and applications, ultimately making software development easier and more accessible.

Abstract

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

View Paper