Self-Recognition in Language Models
Tim R. Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, Caglar Gulcehre
2024-07-13

Summary
This paper investigates whether language models (LMs) can recognize their own outputs and understand their identity, similar to how humans verify their identity. The researchers propose a method using 'security questions' generated by the models themselves to test for self-recognition.
What's the problem?
As language models become more integrated into various applications, there is a risk that they might develop the ability to recognize themselves. This could lead to security concerns, especially if they can distinguish their own outputs from those of others. Understanding whether LMs have this capability is crucial for ensuring their safe use.
What's the solution?
The researchers created a test that involves LMs generating security questions designed to help them identify their own responses among a set of alternatives. They tested this method on ten different LMs, both open-source and closed-source, to assess their self-recognition abilities. However, the results showed no consistent evidence that these models could recognize their own outputs. Instead, the LMs tended to choose the answer they deemed best, regardless of its source.
Why it matters?
This research is important because it highlights potential limitations in the self-awareness of language models. By demonstrating that LMs do not exhibit general self-recognition, the findings suggest that these models may not be as advanced in understanding their own identity as previously thought. This has implications for how we design and implement AI systems, ensuring that they are used safely and effectively without overestimating their capabilities.
Abstract
A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to keep track of frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find indications that preferences about which models produce the best answers are consistent across LMs. We additionally uncover novel insights on position bias considerations for LMs in multiple-choice settings.