Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality
Suzhen Zhong, Ying Zou, Bram Adams
2025-09-19
Summary
This paper investigates how software developers actually use Large Language Models (LLMs) like ChatGPT when writing code, looking at real conversations between developers and these AI tools.
What's the problem?
While LLMs are becoming popular for coding help, we don't really know *how* developers are using them, what kinds of problems they're trying to solve, or if the code the LLMs generate is actually good quality. It's also unclear how back-and-forth conversations with the LLM affect the final result.
What's the solution?
Researchers analyzed a huge dataset of over 82,000 real conversations developers had with an LLM, looking at the code generated in over 20 different programming languages. They found that LLMs tend to give very lengthy responses compared to the initial questions, and most tasks require multiple turns of conversation. They also identified common errors in the code generated, like missing variables in Python and JavaScript, missing comments in Java, and missing header files in C++. Importantly, they saw that code quality can improve over the course of a conversation, especially when developers specifically point out errors and ask the LLM to fix them.
Why it matters?
Understanding how developers interact with LLMs is crucial for improving these tools and making them more effective. This research highlights specific areas where LLMs struggle, like generating complete and correct code in different languages, and shows that providing feedback is key to getting better results. This knowledge can help developers use LLMs more efficiently and help AI developers build better coding assistants.
Abstract
Large Language Models (LLMs) are becoming integral to modern software development workflows, assisting developers with code generation, API explanation, and iterative problem-solving through natural language conversations. Despite widespread adoption, there is limited understanding of how developers interact with LLMs in practice and how these conversational dynamics influence task outcomes, code quality, and software engineering workflows. To address this, we leverage CodeChat, a large dataset comprising 82,845 real-world developer-LLM conversations, containing 368,506 code snippets generated across over 20 programming languages, derived from the WildChat dataset. We find that LLM responses are substantially longer than developer prompts, with a median token-length ratio of 14:1. Multi-turn conversations account for 68% of the dataset and often evolve due to shifting requirements, incomplete prompts, or clarification requests. Topic analysis identifies web design (9.6% of conversations) and neural network training (8.7% of conversations) as the most frequent LLM-assisted tasks. Evaluation across five languages (i.e., Python, JavaScript, C++, Java, and C#) reveals prevalent and language-specific issues in LLM-generated code: generated Python and JavaScript code often include undefined variables (83.4% and 75.3% of code snippets, respectively); Java code lacks required comments (75.9%); C++ code frequently omits headers (41.1%) and C# code shows unresolved namespaces (49.2%). During a conversation, syntax and import errors persist across turns; however, documentation quality in Java improves by up to 14.7%, and import handling in Python improves by 3.7% over 5 turns. Prompts that point out mistakes in code generated in prior turns and explicitly request a fix are most effective for resolving errors.