CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang

2025-02-25

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language
Models

Summary

This paper talks about CodeCriticBench, a new way to test how well AI language models can understand and critique computer code, including both writing code and answering questions about it

What's the problem?

Current methods for testing AI's ability to critique code are too simple and don't cover enough different types of coding tasks. They mostly focus on general thinking skills or just writing code, and the questions they use are often too easy. Also, these tests don't look at all the different ways an AI should be able to understand and critique code

What's the solution?

The researchers created CodeCriticBench, which tests AI on two main coding tasks: writing code and answering questions about code. These tasks come in different difficulty levels to really challenge the AI. They also made detailed checklists to evaluate how well the AI critiques code from many different angles, not just whether it's correct or not. Then, they tested a bunch of existing AI models using CodeCriticBench to show how well it works

Why it matters?

This matters because as AI gets better at working with code, we need good ways to test how well it can actually understand and improve code. CodeCriticBench could help researchers make better AI that can assist programmers more effectively, potentially leading to faster and more reliable software development. It could also help companies choose the right AI tools for code review and programming assistance

Abstract

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

View Paper