HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang

2024-12-31

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Summary

This paper talks about HumanEval Pro and MBPP Pro, new benchmarks designed to evaluate how well large language models (LLMs) can generate code by solving simpler problems and then using those solutions to tackle more complex ones.

What's the problem?

While LLMs have shown great success in generating code for straightforward tasks, they often struggle when faced with more complicated problems that require them to build on their previous solutions. This limitation highlights the need for better evaluation methods that can assess the reasoning and problem-solving abilities of these models in a more challenging context.

What's the solution?

To address this issue, the authors introduced self-invoking code generation as a new task where models first solve a base problem and then use that solution to solve a related, more complex problem. They created three new benchmarks—HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro—that specifically test this capability. The authors found that most LLMs performed well on traditional benchmarks but saw a drop in performance when faced with self-invoking tasks, indicating a gap in their reasoning skills.

Why it matters?

This research is important because it sheds light on the limitations of current LLMs in understanding and generating code. By developing these new benchmarks, the authors provide valuable insights into how these models can be improved. This work can guide future research efforts aimed at enhancing the coding abilities of LLMs, making them more effective tools for developers and programmers.

Abstract

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

View Paper