PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang
2025-09-16
Summary
This paper introduces PersonaX, a new collection of datasets designed to help researchers better understand how personality traits relate to things like a person's appearance and background.
What's the problem?
Currently, it's hard to study personality because there aren't many datasets that combine descriptions of someone's behavior with other information like what they look like or their job history. Researchers often need multiple types of data to get a complete picture, but finding this combined data is a challenge.
What's the solution?
The creators of this paper built PersonaX, which includes two datasets: CelebPersona (about public figures) and AthlePersona (about athletes). They used large language models to assess personality traits from text, then combined those assessments with images of the people and details about their lives. They also developed a new method for analyzing this kind of combined data to figure out how different factors might influence each other.
Why it matters?
This work is important because it provides a valuable resource for anyone studying human behavior, especially in fields like human-computer interaction and artificial intelligence. By linking personality traits to visual and biographical information, it allows for more nuanced and accurate understanding of people, which can lead to better AI systems and a deeper understanding of social dynamics.
Abstract
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.