Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quality Assessment of Python Tests Generated by Large Language Models (2506.14297v1)

Published 17 Jun 2025 in cs.SE

Abstract: The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. LLMs have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C). Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns. Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell. Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%). Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells. Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C). For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%). Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells. These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Victor Alves (26 papers)
  2. Carla Bezerra (2 papers)
  3. Ivan Machado (12 papers)
  4. Larissa Rocha (6 papers)
  5. Tássio Virgínio (2 papers)
  6. Publio Silva (1 paper)
X Twitter Logo Streamline Icon: https://streamlinehq.com