Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

2024-10-21

Looking Inward: Language Models Can Learn About Themselves by Introspection

Summary

This paper discusses how language models (LLMs) can learn about their own internal states through a process called introspection, which could help us understand their behavior better.

What's the problem?

Humans learn not only by observing the world around them but also by reflecting on their own thoughts and feelings. This ability to look inward is called introspection. The problem is that we don't know if LLMs can do something similar. If they could introspect, it might allow us to ask them about their beliefs and goals without having to analyze their complex inner workings directly. However, current models rely heavily on training data and may not have the ability to self-reflect accurately.

What's the solution?

To explore this idea, the authors trained LLMs to predict their own behavior in hypothetical situations. For example, they asked the model questions like, 'If given input P, would you prefer a short-term or long-term option?' The researchers compared two models: one that could introspect (M1) and another that could not (M2). They found that M1 was better at predicting its own behavior than M2, even when M2 was trained on M1's actual responses. This suggests that LLMs can have some level of introspection, especially for simpler tasks.

Why it matters?

This research is important because if LLMs can introspect, it could lead to more advanced AI systems that understand their own processes better. This would not only improve how we interact with AI but also raise questions about the nature of consciousness and self-awareness in machines. Understanding these capabilities could help ensure that AI systems are used ethically and effectively in the future.

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

View Paper