SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models

Scott Thornton

2025-12-23

SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models

Summary

This paper introduces a new dataset called SecureCode v2.0 designed to help improve the security of code written with the help of AI assistants, which are often found to create code with security flaws.

What's the problem?

AI assistants are increasingly used to write code, but they frequently generate code that has security vulnerabilities, potentially leading to problems in real-world applications. Existing datasets used to train these AI assistants to write secure code are inadequate because they lack connections to actual security incidents, aren't large enough for modern AI training, and don't provide enough practical guidance for developers deploying code in a secure environment.

What's the solution?

The researchers created SecureCode v2.0, a dataset of over 1,200 coding examples covering 11 different programming languages and 11 common vulnerability types. Each example shows both a vulnerable and a secure version of the code, explains how the vulnerability could be exploited, and provides detailed advice on how to defend against it, including how to integrate with security monitoring systems and harden infrastructure. The dataset is structured like a conversation between a developer and an AI assistant, starting with simple code and gradually adding security considerations.

Why it matters?

This work is important because it provides a much-needed resource for training AI assistants to write more secure code. By grounding the examples in real-world security incidents and offering practical operational guidance, it helps bridge the gap between theoretical security knowledge and the practical challenges of building and deploying secure software, ultimately making systems less vulnerable to attacks.

Abstract

AI assistants produce vulnerable code in 45% of security-relevant scenarios, introducing flaws into production systems at scale. Yet existing secure coding datasets fall short. They lack incident grounding, don't provide the scale modern training requires, and miss the operational security context developers need for production deployments. We present SecureCode v2.0, a production-grade dataset of 1,215 security-focused coding examples that passed structural validation and expert security review. Every example ties to actual documented security incidents with CVE references, provides vulnerable and secure implementations, demonstrates concrete attacks, and includes defense-in-depth operational guidance. The dataset covers 11 vulnerability categories (complete OWASP Top 10:2025 plus AI/ML Security Threats) across 11 languages (Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, and YAML for infrastructure-as-code). Our quality assurance framework ensures complete incident grounding. Each example includes SIEM integration strategies, infrastructure hardening recommendations (Docker, AppArmor, WAF configurations), and testing approaches using language-appropriate frameworks. The dataset uses a 4-turn conversational structure mirroring actual developer-AI interactions, escalating from basic implementations to advanced security considerations and defense-in-depth guidance. Our contributions: (1) 1,215 rigorously validated examples split into 989 training, 122 validation, and 104 test sets, (2) an automated validation framework ensuring dataset consistency, (3) a 4-turn conversational structure capturing realistic security workflows, (4) comprehensive operational security guidance with SIEM integration strategies, (5) complete language-specific implementation fidelity, and (6) open-source release of data, validation tools, and benchmarking protocols.

View Paper