False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen

2025-09-05

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Summary

This research investigates whether current methods for detecting harmful content in large language models (LLMs) are actually reliable, finding they might be giving a false sense of security.

What's the problem?

LLMs are powerful but can sometimes be tricked into generating harmful responses. Researchers have been trying to use 'probes' – essentially tools that look at the internal workings of the LLM – to identify harmful inputs *before* they cause problems. The issue is that these probes might not be detecting true harmfulness, but instead picking up on easier-to-spot clues like specific words or how the instruction is worded, and failing when faced with slightly different harmful requests.

What's the solution?

The researchers systematically tested these probes by first showing they could be matched in performance by very simple methods that just look for common word sequences. Then, they created datasets where the harmful *meaning* stayed the same but the specific wording was changed. This revealed the probes were focusing on superficial patterns like instructional phrases and trigger words, rather than understanding the underlying harmful intent. They essentially demonstrated that the probes were easily fooled.

Why it matters?

This work is important because it shows that we can't fully rely on current safety detection methods for LLMs. If we think we've solved the problem of harmful outputs when we haven't, it could lead to dangerous consequences. The research calls for a re-evaluation of how we build and test LLMs to ensure they are truly safe and reliable, and provides a starting point for more robust safety measures.

Abstract

Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

View Paper