BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton

2025-10-13

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Summary

This paper introduces BigCodeArena, a new platform for evaluating how well large language models (LLMs) can write code, and uses it to create benchmarks for comparing different models.

What's the problem?

Evaluating code generated by LLMs is really hard for humans because you need to understand complex code and actually *run* it to see if it works. Existing platforms for evaluating LLMs don't really address this specific challenge for code, making it difficult to accurately assess which models are best at coding tasks.

What's the solution?

The researchers built BigCodeArena, which is like a Chatbot Arena but specifically for code. It lets people compare code generated by different LLMs, and importantly, it can actually *execute* that code and show the results to the human evaluator. They collected data from over 14,000 coding conversations and used that data to create two benchmarks: BigCodeReward, which checks how well automated systems can predict human preferences for code, and AutoCodeArena, which automatically ranks models based on their coding performance without needing human input.

Why it matters?

This work is important because it provides a more reliable way to evaluate LLMs for coding. By actually running the code and getting human feedback, they can identify which models are truly good at generating functional and useful code, and they've created tools that can help continue to improve these models in the future. The findings also show that even the most advanced models, like GPT-5, still lead in code generation, but there's still room for improvement in specialized areas.

Abstract

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

View Paper