MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu
2025-10-29
Summary
This paper investigates how easily large vision-language models, which are AI systems that understand both images and text, can be persuaded by information they receive. It looks at whether these models can be influenced to believe things or act in ways that aren't necessarily true or helpful, especially when presented with persuasive content like advertisements or misleading news.
What's the problem?
As these AI models become more common in everyday life – helping with shopping, providing health information, or delivering news – they're constantly exposed to attempts at persuasion. The core issue is that we don't understand *how* susceptible these models are to being swayed by persuasive messages, especially when those messages combine images and text. If a model is too easily persuaded, it could adopt false beliefs, ignore what a user actually wants, or even generate harmful content.
What's the solution?
The researchers created a system called MMPersuade to study this problem systematically. They built a large collection of images and videos paired with persuasive techniques, covering areas like advertising, opinions, and even intentionally misleading information. They then tested six different AI models, measuring how much the models’ responses changed after being exposed to persuasive content and how well those responses aligned with what people generally agree on. They also looked at how the models’ internal ‘thinking’ (represented by the probabilities of different words) shifted during conversations.
Why it matters?
Understanding how these AI models respond to persuasion is crucial for building trustworthy and safe AI systems. The study found that images and videos significantly increase a model’s susceptibility to persuasion, even if the model already has some pre-existing preferences. Knowing which persuasive strategies work best in different situations allows developers to create models that are more resistant to manipulation, respect user preferences, and avoid generating unethical or dangerous outputs.
Abstract
As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how these models function as persuadees-how and why they can be influenced by persuasive multimodal inputs. Understanding both their susceptibility to persuasion and the effectiveness of different persuasive strategies is crucial, as overly persuadable models may adopt misleading beliefs, override user preferences, or generate unethical or unsafe outputs when exposed to manipulative messages. We introduce MMPersuade, a unified framework for systematically studying multimodal persuasion dynamics in LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs images and videos with established persuasion principles across commercial, subjective and behavioral, and adversarial contexts, and (ii) an evaluation framework that quantifies both persuasion effectiveness and model susceptibility via third-party agreement scoring and self-estimated token probabilities on conversation histories. Our study of six leading LVLMs as persuadees yields three key insights: (i) multimodal inputs substantially increase persuasion effectiveness-and model susceptibility-compared to text alone, especially in misinformation scenarios; (ii) stated prior preferences decrease susceptibility, yet multimodal information maintains its persuasive advantage; and (iii) different strategies vary in effectiveness across contexts, with reciprocity being most potent in commercial and subjective contexts, and credibility and logic prevailing in adversarial contexts. By jointly analyzing persuasion effectiveness and susceptibility, MMPersuade provides a principled foundation for developing models that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.