I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution
Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, Norbert Tihanyi
2025-06-24
Summary
This paper talks about CodeT5-Authorship, a new model designed to identify which large language model (LLM) wrote a specific piece of C programming code with very high accuracy.
What's the problem?
The problem is that as AI-generated code becomes more common, it’s important to be able to tell which AI model created a given code snippet to improve security, accountability, and prevent misuse, but this task is difficult because different models can produce very similar code.
What's the solution?
The researchers built a model based on an existing transformer architecture called CodeT5, but they adapted it to focus on classifying code authorship. They created a large testing dataset with tens of thousands of C programs generated by eight different LLMs to train and evaluate their model, and compared it with other machine learning and transformer-based methods, showing it works much better.
Why it matters?
This matters because accurately identifying the source of AI-generated code helps improve trust and safety in software development by detecting vulnerabilities, preventing plagiarism, and ensuring users know where their code comes from.
Abstract
A novel model, CodeT5-Authorship, is introduced to classify the authorship of C programs generated by Large Language Models, achieving high accuracy compared to traditional and transformer-based classifiers.