Personalized Visual Instruction Tuning
Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang
2024-10-10

Summary
This paper introduces Personalized Visual Instruction Tuning (PVIT), a new method that helps large multimodal language models (MLLMs) understand and have personalized conversations about specific individuals in images.
What's the problem?
Current MLLMs struggle with what is called 'face blindness,' meaning they can hold general conversations but fail to personalize their responses to specific people. This makes it difficult for them to be effective in applications like personal assistants on smartphones or robots that need to recognize family members.
What's the solution?
The authors developed PVIT, which creates a system for training MLLMs to recognize individuals in images and engage in meaningful dialogues with them. They built a sophisticated pipeline that automatically generates personalized training data, allowing the model to learn from various visual experts and language models. Additionally, they created a benchmark called P-Bench to evaluate how well these models can personalize their responses based on different types of questions.
Why it matters?
This research is important because it enhances the ability of AI systems to interact with people in a more personalized way. By improving how MLLMs can recognize and respond to individuals, PVIT could lead to better user experiences in technology, making devices smarter and more helpful in everyday life.
Abstract
Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.