Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
2025-03-11
Summary
This paper talks about using special AI tools called Sparse Autoencoders to spot differences between human writing and AI-generated text by looking at hidden patterns in how AI models create sentences.
What's the problem?
It’s hard to tell if text was written by a human or AI because modern language models mimic humans well, and current detection tools struggle to explain how they make decisions or work across different writing styles.
What's the solution?
Researchers used Sparse Autoencoders to analyze the ‘thought process’ of an AI language model, uncovering specific patterns that separate human writing from AI text, even when the AI tries to sound human.
Why it matters?
This helps build better tools to detect AI-generated content, which is crucial for preventing cheating, fake news, and plagiarism while making detection systems more transparent and trustworthy.
Abstract
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.