Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, David Lo

2025-03-06

Benchmarking Large Language Models for Multi-Language Software
Vulnerability Detection

Summary

This paper talks about testing how well different AI language models can find security problems in computer code written in multiple programming languages

What's the problem?

While AI language models are being used more in software development, we don't know enough about how good they are at finding security vulnerabilities in code, especially across different programming languages. Most studies only look at one or two ways of using these models and focus on C/C++ code

What's the solution?

The researchers created a big dataset with over 44,000 examples of vulnerable code in Python, Java, and JavaScript. They tested five open-source AI models using different methods like giving them special instructions or training them on specific tasks. They compared these to smaller, specialized models and traditional code-checking tools. They also tried ways to make the AI models better, like balancing the training data and combining predictions from multiple models

Why it matters?

This matters because finding security problems in code is crucial for keeping software safe. By understanding how well AI can do this job, we can improve our tools for catching vulnerabilities before they become real threats. This could lead to safer software and fewer cyber attacks in the future

Abstract

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

View Paper