A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang

2025-09-01

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Summary

This paper is about evaluating how well large language models (LLMs) generate secure code, specifically when asked to fix security flaws in existing software projects.

What's the problem?

Currently, there aren't good ways to test the security of code written by LLMs. Existing tests usually look at small, isolated pieces of code, and the results aren't always consistent or reliable. Also, these tests don't consider how much information about the overall project is given to the LLM, which can affect the security of the generated code. It's hard to know if an LLM is *actually* fixing a security problem or just making changes that look right.

What's the solution?

The researchers created a new benchmark called A.S.E. (AI Code Generation Security Evaluation). This benchmark uses real software projects that have known security vulnerabilities (documented as CVEs). It gives the LLMs the entire project to work with, including all the files and build instructions, just like a real developer would have. They also built a system to automatically and consistently check if the LLM's changes actually fix the security problem, build correctly, and don't introduce new issues. They then tested several leading LLMs using this benchmark.

Why it matters?

This work is important because as we start using LLMs more and more in software development, we need to be sure the code they create is secure. A.S.E. provides a more realistic and reliable way to test LLMs, helping developers understand which models are best at fixing security vulnerabilities and how to use them safely. The findings also suggest that simpler approaches to code generation can sometimes be more secure than complex reasoning strategies.

Abstract

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.

View Paper