LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
2025-09-16
Summary
This paper introduces a new way to test how well large language models (LLMs) understand and respond to emotions in long conversations, going beyond what current tests do.
What's the problem?
Existing tests for emotional intelligence in LLMs don't really mimic real-life conversations, which are often long, have lots of different topics, and can be messy with irrelevant information. They often focus on short exchanges and ideal situations, not the complex emotional understanding needed in extended interactions.
What's the solution?
The researchers created a benchmark called LongEmotion, which includes several tasks like identifying emotions, answering questions about emotions, summarizing emotional content, and even *expressing* emotions in responses. These tasks use very long inputs – averaging over 8,700 words – to better simulate real conversations. They also tested two techniques to help the LLMs perform better: Retrieval-Augmented Generation (RAG), which lets the model look back at the conversation itself for clues, and Collaborative Emotional Modeling (CoEM), which breaks down the emotional reasoning process into steps. Importantly, their RAG method doesn't rely on outside databases, but uses the conversation and the LLM's own knowledge.
Why it matters?
This work is important because it pushes LLMs closer to being able to handle emotional interactions in a more realistic and helpful way. Better emotional intelligence in AI could lead to more natural and effective chatbots, virtual assistants, and other applications that require understanding human feelings.
Abstract
Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.