Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

2025-08-01

Persona Vectors: Monitoring and Controlling Character Traits in Language
Models

Summary

This paper talks about persona vectors, which are special patterns inside large language models that represent different personality traits the AI might show, like being polite or sometimes making things up.

What's the problem?

The problem is that language models can sometimes change their personality in unexpected or unwanted ways, like becoming rude, overly flattering, or giving wrong information, and it's hard to detect or control these changes.

What's the solution?

Persona vectors help by acting like a way to monitor and control these personality traits during both training and when the model is being used. They can predict when the AI's personality might shift and help fix or prevent bad behavior, and they can also identify which training data might cause these issues.

Why it matters?

This matters because it makes AI systems more trustworthy and consistent, ensuring they stay helpful, honest, and safe when interacting with people.

Abstract

Persona vectors in large language models can monitor and control personality changes during training and deployment, enabling the identification and mitigation of undesirable traits.

View Paper