PersonaVLM: Long-Term Personalized Multimodal LLMs

Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan

2026-04-20

PersonaVLM: Long-Term Personalized Multimodal LLMs

Summary

This paper introduces a new system called PersonaVLM that aims to make AI assistants, specifically those that can understand both images and text, much better at adapting to your individual preferences over time.

What's the problem?

Current AI assistants are pretty good at responding to single requests, but they struggle to remember what you like and dislike over a series of conversations. They can’t really build a ‘personality’ for you and consistently tailor their responses to match it, meaning they feel less like a personal assistant and more like a generic chatbot. Existing methods only allow for basic, one-time adjustments to the AI’s behavior.

What's the solution?

The researchers created PersonaVLM, which works in three main ways. First, it actively *remembers* past interactions by summarizing key details into a personal database. Second, it *reasons* by looking back at this database to understand the context of the current conversation and what you’ve liked in the past. Finally, it *aligns* its responses to your evolving personality, making sure the AI sounds and acts more like the assistant *you* want it to be. They also created a new set of tests, called Persona-MME, to specifically measure how well AI systems personalize over time.

Why it matters?

This work is important because it moves AI assistants closer to being truly helpful and personalized. By allowing the AI to learn and adapt to your preferences over many interactions, it can provide more relevant, engaging, and satisfying responses. The results show PersonaVLM significantly outperforms existing methods and even surpasses the performance of advanced models like GPT-4o in personalized interactions.

Abstract

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

View Paper