LLM Safety From Within: Detecting Harmful Content with Internal Representations
Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang, Linfeng Du, Haolun Wu, Ashton Anderson
2026-04-27
Summary
This paper introduces SIREN, a new way to detect harmful content in what large language models (LLMs) say or are asked to say. It's a 'guard model' designed to flag potentially dangerous prompts or responses.
What's the problem?
Current guard models that check for harmful content only look at the final output of an LLM. They miss important safety signals that are actually present *within* the LLM's processing steps, kind of like only reading the conclusion of a report without looking at the evidence used to reach it. This means they aren't as good at catching subtle or complex harmful content.
What's the solution?
The researchers created SIREN, which doesn't just look at the final output. Instead, it examines the internal workings of the LLM, identifying specific parts (called 'neurons') that seem to react to unsafe content. It then combines the signals from these important neurons using a smart weighting system to accurately detect harmfulness. Importantly, SIREN doesn't change the original LLM itself, it just adds a layer on top to analyze it.
Why it matters?
SIREN is a big step forward because it's much more effective at detecting harmful content than existing methods, while also being much smaller and faster. This means it can be used in real-time applications and doesn't require a lot of computing power. It also shows that we can build better safety tools by looking *inside* LLMs, rather than just at their outputs, which could lead to even more robust and reliable AI safety systems.
Abstract
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.