Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee

2025-08-29

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Summary

This paper investigates how well large language models (LLMs) handle being persuaded, both with true information and with false information, during a conversation. It highlights a weakness in current models where they can be easily misled and also struggle to accept corrections, and then proposes a way to improve this.

What's the problem?

LLMs are often used in conversational settings, and it's crucial they aren't easily tricked into believing and spreading misinformation. Current models have trouble with this – they can be too easily persuaded by false statements and, surprisingly, can also resist accepting valid corrections. This creates a problem for reliable use because you want a model that's open to learning and updating its knowledge, but also resistant to being manipulated. The researchers specifically looked at this issue in two areas: general knowledge and safety-related topics.

What's the solution?

The researchers created a testing framework called DuET-PD to carefully evaluate how LLMs change their opinions over multiple turns of a conversation, depending on whether the persuasion is truthful or misleading. They found even advanced models like GPT-4o weren't very good at resisting misinformation. To fix this, they developed a new training method called Holistic DPO. This method teaches the model to respond well to *both* positive persuasion (correct information) and negative persuasion (incorrect information), unlike other methods that only focus on one or the other. They used this method to improve a model called Llama-3, significantly increasing its ability to resist misleading information, especially in safety-critical areas.

Why it matters?

This research is important because it identifies a significant flaw in current LLMs that could lead to the spread of misinformation or make them unreliable in important applications. By developing a new training method that improves both receptiveness to corrections and resistance to falsehoods, the researchers offer a path towards building more trustworthy and adaptable AI systems for real-world conversations and decision-making.

Abstract

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

View Paper