Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan

2024-10-30

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

Summary

This paper discusses REPOCOD, a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate code in real-world scenarios, and concludes that LLMs cannot yet fully replace human programmers.

What's the problem?

While LLMs have shown high accuracy in generating simple code, existing benchmarks do not accurately reflect the complexity of real-world software development. Most current tests are too basic and do not account for the challenges faced by programmers when working on actual projects, making it hard to determine if LLMs can truly replace human coders.

What's the solution?

To address this issue, the authors created REPOCOD, which includes 980 coding problems sourced from 11 popular real-world projects. This benchmark is designed to test the models on more complex tasks that require understanding of the entire project context, rather than just isolated code snippets. The evaluation showed that no LLM could achieve more than 30% success on these tasks, indicating that they still struggle with the complexities involved in real programming tasks.

Why it matters?

This research is important because it highlights the limitations of current LLMs in software development. By establishing a more realistic benchmark like REPOCOD, it sets a standard for future improvements in AI coding tools and emphasizes the need for stronger models that can better assist human programmers.

Abstract

Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

View Paper