SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang
2024-08-27

Summary
This paper introduces SWE-bench-java, a benchmark designed to evaluate how well large language models (LLMs) can resolve issues on GitHub, specifically for Java programming.
What's the problem?
Resolving issues on GitHub is an important task in software development, but existing benchmarks have only focused on Python. This limits the ability to assess LLMs in other popular programming languages like Java, which is widely used in the industry.
What's the solution?
The authors created SWE-bench-java as a Java version of the SWE-bench benchmark. They released a dataset and a Docker-based evaluation environment to help test how well LLMs can handle Java issues. They also implemented a method called SWE-agent to verify the reliability of this new benchmark by testing several powerful LLMs on it.
Why it matters?
This research is significant because it expands the capabilities of automated issue resolution in software engineering to include Java, making it easier for developers to use AI tools effectively across different programming languages. By improving benchmarking for LLMs, it can lead to better automated solutions for real-world coding problems.
Abstract
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.