REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation
Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, Francesco Barbieri
2025-02-20
Summary
This paper talks about REALTALK, a new dataset of real conversations between people over 21 days. It's like a record of how actual humans chat with each other on messaging apps, which is used to help make AI chatbots better at having long, natural conversations.
What's the problem?
Current chatbots are trained using fake conversations made by other AI systems. This means they don't really understand how real people talk to each other over time, especially when it comes to showing emotions and remembering past chats. It's like trying to teach someone how to be a good friend by only showing them scripted TV shows instead of real friendships.
What's the solution?
The researchers created REALTALK by having real people chat with each other for 21 days. They then studied these conversations to see how people express emotions and maintain their personality over time. They also made two tests for AI: one where the AI tries to continue a conversation as if it were one of the people, and another where it has to answer questions about things mentioned earlier in the chat. This helps show how well AI can mimic real people and remember important details from long conversations.
Why it matters?
This matters because it helps make AI chatbots more human-like and better at long conversations. By understanding real chat patterns, emotions, and how people remember things, we can create AI that's better at talking to us naturally over time. This could lead to more helpful virtual assistants, better online customer service, and even AI companions that can have meaningful, long-term relationships with people.
Abstract
Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.