From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici

2025-02-20

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding
Interactions

Summary

This paper talks about MemoryCode, a new way to test how well AI language models can work with people over long periods, especially when it comes to coding tasks. It's like checking if a smart computer assistant can remember and follow instructions from multiple conversations, not just one.

What's the problem?

AI language models are great at solving single problems, but they struggle when they need to remember information from multiple conversations over time. This is a big issue because in real work situations, we often need to collaborate with others (or AI) over many sessions, not just one. It's like having a super-smart friend who forgets everything you told them yesterday.

What's the solution?

The researchers created MemoryCode, a special test that simulates real-world coding scenarios with multiple conversations. They used this to evaluate different AI models, including very advanced ones like GPT-4. They found that while all models could handle simple, isolated instructions well, even the best models had trouble when instructions were spread out over multiple sessions. This helped them identify that the main problem is the AI's inability to remember and connect information from earlier conversations.

Why it matters?

This matters because as we start to use AI more in our work, we need to know if it can be a reliable long-term teammate. The study shows that current AI, even the most advanced, has trouble with this kind of long-term collaboration. This highlights a big area for improvement in AI technology. If we can solve this problem, it could lead to AI assistants that can truly work alongside humans on complex, long-term projects, potentially revolutionizing how we work with computers in fields like software development, research, and more.

Abstract

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

View Paper