Brain-Grounded Axes for Reading and Steering LLM States

Sandro Andric

2025-12-23

Brain-Grounded Axes for Reading and Steering LLM States

Summary

This research explores a new way to understand and control what's happening inside large language models (LLMs), like the ones powering chatbots. Instead of relying on text-based clues to figure out how these models work, they use data from human brain activity.

What's the problem?

Currently, methods for understanding LLMs often depend on analyzing the text they process. This can be limiting because the text itself doesn't always fully explain *why* the model is making certain decisions or how it represents concepts. It's like trying to understand a person's thoughts just by reading their emails – you're missing a lot of the underlying brain processes.

What's the solution?

The researchers recorded brain activity using a technique called MEG while people thought about different words. They then created a 'brain atlas' that maps words to specific patterns of brain activity. Next, they trained small additions to the LLM (called adapters) to connect the model's internal workings to this brain atlas, essentially allowing the model to 'think' in terms of brain activity patterns. This lets them steer the model's behavior by activating specific brain-related areas, and they found consistent patterns across different LLMs.

Why it matters?

This work is important because it offers a more direct and potentially more accurate way to interpret and control LLMs. By grounding the model's behavior in actual human brain activity, it could lead to more trustworthy and understandable AI systems. It provides a new 'handle' for controlling LLMs, moving beyond just manipulating text and towards a deeper understanding of how these models represent and process information.

Abstract

Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

View Paper