I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution

Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, Norbert Tihanyi

2025-06-24

I Know Which LLM Wrote Your Code Last Summer: LLM generated Code
Stylometry for Authorship Attribution

Summary

This paper talks about CodeT5-Authorship, a new model designed to identify which large language model (LLM) wrote a specific piece of C programming code with very high accuracy.

What's the problem?

The problem is that as AI-generated code becomes more common, it’s important to be able to tell which AI model created a given code snippet to improve security, accountability, and prevent misuse, but this task is difficult because different models can produce very similar code.

What's the solution?

The researchers built a model based on an existing transformer architecture called CodeT5, but they adapted it to focus on classifying code authorship. They created a large testing dataset with tens of thousands of C programs generated by eight different LLMs to train and evaluate their model, and compared it with other machine learning and transformer-based methods, showing it works much better.

Why it matters?

This matters because accurately identifying the source of AI-generated code helps improve trust and safety in software development by detecting vulnerabilities, preventing plagiarism, and ensuring users know where their code comes from.

Abstract

A novel model, CodeT5-Authorship, is introduced to classify the authorship of C programs generated by Large Language Models, achieving high accuracy compared to traditional and transformer-based classifiers.

View Paper