Detection Avoidance Techniques for Large Language Models

Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek

2025-03-11

Detection Avoidance Techniques for Large Language Models

Summary

This paper talks about ways people can trick AI detectors that spot text written by AI, like making small changes to fool the system into thinking it’s human-written.

What's the problem?

AI detectors designed to catch fake news or AI-generated content can be easily fooled by tweaking how the AI writes or rewriting its output, making it hard to trust online information.

What's the solution?

Researchers tested three tricks: adjusting how random the AI’s writing is, training the AI to hide its style using trial-and-error learning, and rewriting AI text to dodge detection while keeping the meaning the same.

Why it matters?

This shows we need better detectors to fight fake news and scams, since current tools can be tricked, risking misinformation spread in places like social media or schools.

Abstract

The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

View Paper