In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman

2025-12-04

Summary

This research paper introduces a new way to trick large language models (LLMs) into giving harmful responses by subtly changing the way the model *understands* words, even if the prompt itself seems harmless.

What's the problem?

Large language models are designed with safety features to prevent them from responding to dangerous requests, like instructions for building a bomb. However, these safety measures focus on recognizing specific keywords. This leaves a vulnerability because the model's internal understanding of words – its 'representation' – can be manipulated without changing the actual words in the prompt.

What's the solution?

The researchers developed an attack called 'Doublespeak' where they repeatedly show the model examples that replace a harmful word (like 'bomb') with a harmless one (like 'carrot'). By doing this consistently *before* asking a dangerous question, they found the model starts to associate the harmless word with the harmful meaning. So, when asked 'How to build a carrot?', the model actually thinks you're asking about a bomb, bypassing the safety filters. They used tools to visualize how this change happens inside the model, showing the meaning shifts layer by layer.

Why it matters?

This research shows that current safety methods for LLMs aren't enough. Simply blocking keywords isn't sufficient because the model's internal understanding can be hijacked. Future safety measures need to focus on protecting the model's 'thought process' – its internal representations – rather than just the words it sees, making these models truly safe and reliable.

Abstract

We introduce Doublespeak, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

View Paper