Language Models are Injective and Hence Invertible
Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
2025-10-23
Summary
This paper investigates whether transformer language models, like those powering many AI applications, actually 'lose' information about the original input text as they process it, and surprisingly finds they don't.
What's the problem?
Typically, components within transformer models aren't 'injective,' meaning different inputs can result in the same output. This raises concerns that the model might not be able to perfectly recreate the original text from its internal workings, essentially losing some of the input information. If information is lost, it makes understanding *why* a model made a certain decision much harder, and could even create safety issues.
What's the solution?
The researchers first proved mathematically that transformers, when converting text into their internal numerical representations, are actually 'injective' – meaning each unique input text leads to a unique internal representation. They then tested this idea by running billions of tests on six different state-of-the-art language models and found no instances where different inputs created the same internal representation. Finally, they created an algorithm called SipIt that can reliably reconstruct the original text *exactly* from the model's internal activity, proving it's possible to 'reverse' the process.
Why it matters?
This work is important because it shows that language models don't inherently lose information about the input text. This injectivity property opens up possibilities for making these models more transparent – we can understand what information they're processing – and more interpretable – we can figure out *why* they're making certain predictions. It also has implications for building safer AI systems, as we can be more confident in tracing back the model's reasoning.
Abstract
Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.