Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 226 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Automatic High-Level Test Case Generation using Large Language Models (2503.17998v1)

Published 23 Mar 2025 in cs.SE

Abstract: We explored the challenges practitioners face in software testing and proposed automated solutions to address these obstacles. We began with a survey of local software companies and 26 practitioners, revealing that the primary challenge is not writing test scripts but aligning testing efforts with business requirements. Based on these insights, we constructed a use-case $\rightarrow$ (high-level) test-cases dataset to train/fine-tune models for generating high-level test cases. High-level test cases specify what aspects of the software's functionality need to be tested, along with the expected outcomes. We evaluated LLMs, such as GPT-4o, Gemini, LLaMA 3.1 8B, and Mistral 7B, where fine-tuning (the latter two) yields improved performance. A final (human evaluation) survey confirmed the effectiveness of these generated test cases. Our proactive approach strengthens requirement-testing alignment and facilitates early test case generation to streamline development.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents that fine-tuned 10B-scale LLMs automatically generate high-level test cases, bridging the gap between requirements and testing.
  • It details a four-phase methodology including practitioner surveys, dataset construction, LLM evaluation, and qualitative human assessment.
  • Quantitative metrics and human evaluations confirm that tailored LLMs enhance relevance and completeness for in-house, confidential test design.

Automatic High-Level Test Case Generation using LLMs

Introduction and Motivation

Automated software testing remains a significant bottleneck in aligning development with business requirements, particularly due to persistent communication gaps between requirement engineers and testers. The paper addresses a core pain point identified via industrial surveys: testers struggle more with identifying what to test, rather than the mechanics of test script development. High-level test cases—abstract specifications describing which functionalities, behaviors, and edge scenarios require validation—emerge as pivotal artifacts for improving requirements-test alignment and promoting early, business-driven validation. Figure 1

Figure 1: Illustration of mapping a use case such as "Add Item in Shopping Cart" into corresponding high-level test cases encompassing positive, negative, and edge-case scenarios.

Methodology and Dataset Construction

The paper proceeds through four phases:

  1. Practitioner survey to confirm and characterize the requirement-testing gap;
  2. Creation and curation of a use-case to high-level test case dataset (n=1067), designed for both training and evaluation;
  3. Assessment of LLM architectures—pre-trained and fine-tuned—for test case generation;
  4. Human evaluation for qualitative benchmarking. Figure 2

    Figure 2: Workflow depicting the four phases: survey, dataset construction, LLM evaluation, and human validation.

Core to this effort is a curated dataset spanning 168 real-world and student-driven software projects. Test case quality was ensured through a combination of faculty supervision and manual validation, particularly for cases synthesized from tools like UiPath and GPT-4o. The diversity across domains (e-commerce, finance, healthcare, etc.) ensures coverage of real-world variability.

LLM-driven High-Level Test Case Generation

Prompt Engineering and LLM Selection

The paper rigorously analyzes the efficacy of multiple LLMs: GPT-4o, Gemini, LLaMA 3.1 8B, and Mistral 7B. A one-shot prompt engineering template is used, in which a "canonical" example of a use case with associated test cases is provided to the LLM, followed by a target use case for which test cases are requested. Figure 3

Figure 3: Example of a one-shot prompt engineering strategy for high-level test case generation from a use case.

Parameterization (e.g., temperature, top-p) is controlled for determinism in model output. For local deployability, LLaMA and Mistral models are both used in their pre-trained and QLoRA-finetuned forms, directly responding to the industrial requirement for data privacy.

Quantitative Performance and Trade-offs

Performance is evaluated using BERTScore (RoBERTa-based), prioritizing semantic equivalence over surface-form overlap—a critical aspect when substitutive wording is permissible but semantic coverage must be preserved.

Results:

  • GPT-4o: BERTScore F1 = 88.43%
  • Gemini: BERTScore F1 = 87.92%
  • raw LLaMA 3.1 8B: BERTScore F1 = 79.85%, after fine-tuning: 89.94%
  • raw Mistral 7B: BERTScore F1 = 82.86%, after fine-tuning: 90.14%

Fine-tuned LLaMA 3.1 8B and Mistral 7B outperform larger pre-trained models like GPT-4o and Gemini using only 10B-scale parameterizations. This is a substantial result for private deployment scenarios.

Human Evaluation: Quality, Failure Modes, and Coverage

A controlled survey with 26 industry and academic experts (double-blind, randomized order) assessed readability, correctness, completeness, relevance, and usability.

  • LLM-generated test cases (GPT-4o, LLaMA 3.1 8B) scored above 3.5/5 in all dimensions, with human-authored cases scoring highest, but LLMs sometimes exceeding humans in readability.
  • Completeness—especially regarding edge/negative cases—remains a challenge for both AI and human authors.
  • Notable failure modes in LLM-generated tests:
    • Hallucination: Irrelevant scenario inclusion.
    • Redundancy: Duplicate case rephrasings.
    • Incompleteness: Missed edge validations (e.g., missing input constraint checks).

A trade-off was observed: LLaMA's fine-tuning led to more focused (relevant but potentially incomplete) cases, whereas GPT-4o tended to favor completeness at the expense of relevance.

Contextual Enhancements and RAG

The paper explored two augmentation strategies:

  • Contextual prompting: Adding project/module summaries to prompts produced negligible gains, indicating well-written use cases already encode most context needed.
  • Retrieval Augmented Generation (RAG): Supplying more relevant prior examples as prompt context had only marginal effects due to heterogeneity across projects. Intra-project in-context learning could mitigate this limitation.

Limitations, Threats to Validity, and Practical Deployment

The experimental setup is comprehensive—balancing 80/20 train-test splits with project-wise partitioning to enforce strict evaluation on unseen domains. Quality assurance for the dataset and double rater human evaluations with weighted Cohen’s kappa (0.42) are noted.

Key practical implications:

  • Open-source, 10B-scale LLMs—fine-tuned on domain use-case data—are a viable enterprise option for in-house, confidential AI-powered test design.
  • The LLM approach is well-suited for test-driven and requirements-driven development workflows; automated high-level case generation can precede code, thus supporting early bug/requirement misalignment detection.
  • Automated high-level test generation is a significant time saver and risk mitigator, especially for organizations lacking experienced testers or with rapidly changing requirement sets.

Implications for AI and SE Practice

This work demonstrates that LLM-based test case synthesis is feasible and scalable for high-level test cases—abstracting from the level of code to business logic validation. The robustness of LLMs in generating semantically rich scenario validations, including positive, negative, and edge cases, means their deployment bridges the requirements-test gap.

The results underscore the necessity of fine-tuning even performant pre-trained models for tightly constrained domains or privacy-centric deployments. The transition from generic SaaS access (e.g., GPT-4o) to private, locally-deployed models reduces risk and operational cost.

The primary bottleneck, as evidenced, is not in LLM capacity but in coverage and proper negative/edge-case mining. Advances in curriculum learning and in-domain, edge-case-focused training may further close this gap.

Future Research Directions

  • Targeted curriculum learning to incentivize edge and negative case identification.
  • Domain-adaptive fine-tuning for regulated sectors (e.g., healthcare, finance) to further boost relevance and coverage.
  • Deeper integration with software development pipelines for real-time use-case-to-test-case generation.
  • Systematic exploration of prompt/rag-based techniques in intra-project settings for higher fidelity in scenario mapping.

Conclusion

This paper establishes automated high-level test case generation using LLMs as a robust, practical aid for requirement-driven software testing. The introduction of a substantial, validated benchmark dataset, paired with empirical evidence showing fine-tuned mid-scale LLMs can equal or exceed larger pre-trained models, presents a compelling path forward for confidential, automated, and scalable quality assurance workflows in software engineering.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com