Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 85 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 10 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4 31 tok/s Pro

2000 character limit reached

Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality (2509.10402v1)

Published 12 Sep 2025 in cs.SE

Abstract: LLMs are becoming integral to modern software development workflows, assisting developers with code generation, API explanation, and iterative problem-solving through natural language conversations. Despite widespread adoption, there is limited understanding of how developers interact with LLMs in practice and how these conversational dynamics influence task outcomes, code quality, and software engineering workflows. To address this, we leverage CodeChat, a large dataset comprising 82,845 real-world developer-LLM conversations, containing 368,506 code snippets generated across over 20 programming languages, derived from the WildChat dataset. We find that LLM responses are substantially longer than developer prompts, with a median token-length ratio of 14:1. Multi-turn conversations account for 68% of the dataset and often evolve due to shifting requirements, incomplete prompts, or clarification requests. Topic analysis identifies web design (9.6% of conversations) and neural network training (8.7% of conversations) as the most frequent LLM-assisted tasks. Evaluation across five languages (i.e., Python, JavaScript, C++, Java, and C#) reveals prevalent and language-specific issues in LLM-generated code: generated Python and JavaScript code often include undefined variables (83.4% and 75.3% of code snippets, respectively); Java code lacks required comments (75.9%); C++ code frequently omits headers (41.1%) and C# code shows unresolved namespaces (49.2%). During a conversation, syntax and import errors persist across turns; however, documentation quality in Java improves by up to 14.7%, and import handling in Python improves by 3.7% over 5 turns. Prompts that point out mistakes in code generated in prior turns and explicitly request a fix are most effective for resolving errors.

Summary

The paper presents extensive empirical analysis on 82,845 developer-LLM conversations, revealing common prompt design gaps and iterative interaction dynamics.
The study employs comprehensive metrics and static analysis tools across five programming languages to quantify code quality and identify language-specific defects.
Findings emphasize the importance of iterative prompt refinement and post-generation verification to mitigate syntax and documentation errors in LLM-generated code.

Empirical Analysis of Developer-LLM Conversations and Code Quality

Introduction

This paper presents a comprehensive empirical paper of developer interactions with LLMs in real-world software engineering contexts, focusing on conversational dynamics and the quality of generated code. Leveraging the CodeChat dataset—comprising 82,845 developer-LLM conversations and 368,506 code snippets across more than 20 programming languages—the paper systematically characterizes conversation structures, topical trends, and code quality issues. The analysis provides quantitative insights into how developers engage with LLMs, the nature of their requests, and the strengths and limitations of LLM-generated code.

Dataset Construction and Methodology

The CodeChat dataset is derived from the WildChat corpus, which aggregates public ChatGPT interactions. Filtering for code-centric exchanges yields a large-scale, naturalistic dataset of developer-LLM conversations. The paper defines and applies a suite of conversation-level metrics, including token ratio, turn count, prompt design gap frequency, programming language rate, lines of code, and multi-language co-occurrence rate. Topic modeling is performed using BERTopic on 52,086 English prompts to identify developer intent clusters. Code quality is assessed using static analysis tools (Pylint, ESLint, Cppcheck, PMD, Roslyn) across Python, JavaScript, C++, Java, and C#.

Conversational Dynamics and Language Distribution

LLM responses are substantially more verbose than developer prompts, with a median token-length ratio of 14:1 and an average response length of 2,000 characters—2.4 times longer than Stack Overflow answers. Multi-turn conversations constitute 68% of the dataset, primarily driven by shifting requirements, incomplete prompts, and clarification requests. The most frequent prompt design gaps are "Different Use Cases" (37.8%), "Missing Specifications" (14.8%), and "Additional Functionality" (12.8%), indicating that iterative refinement and prompt ambiguity are prevalent.

LLMs generate code in over 20 languages, with Python (31%) and JavaScript (9%) being most common. The distribution diverges from real-world usage for JavaScript and Bash, but aligns for Python and C++. Code snippets are generally concise (median <30 LOC), with C and HTML being the most verbose. Multi-language code generation is frequent, especially for web development (CSS-HTML, HTML-JavaScript).

Topical Trends in Developer Prompts

Topic modeling reveals that web design (9.6%) and machine learning model training (8.7%) are the most common developer-LLM interaction topics. Web-related tasks predominantly use HTML and JavaScript, while machine learning tasks are Python-centric. Other notable topics include low-level programming (C/C++), binary patching, and business tool automation. Engagement levels, measured by turn count, vary significantly by topic, with AI-augmented business automation showing the highest interaction (mean 3.40 turns). Extended conversations are often characterized by repeated shifts in use cases and alternation between feature requests and error correction.

Code Quality Assessment

Static analysis of 63,685 code snippets across five languages uncovers widespread and language-specific defects:

Python: Invalid naming (83.4%), undefined variables (30.8%), import errors (20.8%), and missing documentation (32.1%). Undefined variable errors increase across turns, while import errors decrease.
JavaScript: Undefined variables (75.3%), unused variables (33.7%), and syntax errors (14.4%). Syntax errors are more frequent than in human-written code.
C++: Missing headers (41.1%), unused functions (17.8%), and syntax errors (9.4%). No significant improvement across turns.
Java: Missing required comments (75.9%), lack of final on local variables (45.3%), and documentation violations. Documentation quality improves by 14.7% over five turns.
C#: Unresolved namespaces (49.2%), missing documentation (42.4%), and accessibility omissions (10.0%).

Prompts that explicitly point out mistakes and request fixes are most effective for resolving syntax errors (22.8%), followed by guided questions (16.9%) and specific instructions (16.5%). These strategies are associated with successful error correction and reduced turn count.

Implications

For Developers

LLM-generated code requires rigorous post-response verification, including static analysis and iterative prompt refinement. High prevalence of syntax, structural, and maintainability issues necessitates careful integration into production workflows.

For Conversational Code Assistants

Artifact management and automated post-generation workflows should be prioritized. Assistants must support multi-language context management, version control, and automated code checking to enhance usability and reliability.

For IDE Tool Builders

Integration of conversational interfaces with automated context management, dependency tracking, and versioning is essential. IDEs should facilitate intuitive code insertion, cross-file linking, and context-aware prompting to support iterative development.

For Researchers

Optimizing token allocation and code tokenization is critical to balance accuracy and cost. New benchmarks reflecting real-world developer tasks (e.g., web design, ML model training) are needed. Improving LLM-generated documentation and maintainability remains an open challenge.

Conclusion

This paper provides a detailed empirical characterization of developer-LLM interactions and the quality of generated code in practical software engineering scenarios. The findings highlight the prevalence of iterative, multi-turn conversations, frequent code defects, and the need for improved prompt engineering, artifact management, and automated verification. Future work should focus on domain-specific benchmarks and advanced error-correction techniques to enhance the trustworthiness and utility of LLM-generated code.