Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models

Published 3 Apr 2026 in cs.SE | (2604.02754v1)

Abstract: The increasing use of LLMs in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on LLMs. As a result, the energy and carbon characteristics of small LLMs (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.

Authors (2)

Summary

  • The paper introduces novel composite sustainability metrics (SVI and GFβ) to evaluate test script generation performance and eco-efficiency.
  • It employs a rigorous five-phase methodology combining adaptive prompt engineering and quantization regimes with regional carbon intensity analysis.
  • Empirical results demonstrate that model performance and environmental impact depend on the interplay between quantization depth, prompt design, and regional energy profiles.

Sustainability-Performance Trade-Offs in Prompt-Driven Test Script Generation with Small LLMs

Introduction

The empirical study "An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small LLMs" (2604.02754) advances the understanding of eco-efficiency in AI-driven software engineering by foregrounding sustainability metrics in prompt-driven automated test script generation with small LLMs (SLMs; 2B–8B parameters). Most prior research in this area has concentrated on LLMs and neglected nuanced exploration of the performance–environmental cost trade-offs for SLMs operating in real production environments—especially in the context of varying prompt designs, quantization regimes, and regional carbon intensities.

Experimental Pipeline and Methodological Rigor

The analytical pipeline (Figure 1) operationalizes a five-phase workflow: dataset preprocessing (leveraging the HumanEval benchmark); development of adaptive prompt templates (APV0_{V_0}–APV3_{V_3}, based on Anthropic prompt engineering principles); parameterized inference with state-of-the-art SLMs (Phi-3.5-mini, Qwen2.5-1.5B, deepseek-coder-7b, Mistral-7B, Llama-3-8B) across quantization schemes; granular trace-based sustainability metric logging (with CodeCarbon, regional grid intensity tracking, and functional code coverage via Coverage.py); and systematic multidimensional trade-off analysis. Figure 1

Figure 1: Core empirical pipeline capturing preprocessing, prompt engineering, test script generation under varying SLM/prompt/quantization, and sustainability trade-off computation.

Evaluation Metrics and Novel Sustainability Indices

The study introduces and operationalizes two composite sustainability metrics, both targeted at the inferential phase of SLM-driven code generation under zero-shot prompting:

  • Sustainability Velocity Index (SVI): A normalized, multiplicative metric integrating code coverage, software carbon intensity (SCI), execution duration, and run-to-run stability. SVI's construction enforces that weaknesses in any axis (e.g., high emissions, slow inference, low coverage, or instability) are not masked by strengths elsewhere.
  • Green FβF_\beta Score (GFβ_\beta): A tunable composite of eco-efficiency (ECO\text{ECO}) and accuracy (coverage), with the trade-off parameter β\beta determining the emphasis on ecological versus functional priorities.

These indices depart from established single-dimensional efficiency benchmarks and enable explicit ranking and prioritization, supporting both technical and policy-driven model selection.

Empirical Results: Model, Prompt, and Region Sensitivities

Software Carbon Intensity (SCI) and Regional Effects

SCI results (Figure 2) expose the dominant impact of regional grid carbon intensity, with lower SCI for inferences performed in regions with cleaner grids (e.g., Netherlands, USA–Oregon) and elevated SCI in higher intensity grids (e.g., Singapore, USA–Nevada). When controlling for region, prompt structure and model architecture are the primary determinants of energy and emissions overhead per generated script. Figure 2

Figure 2: SCI evaluation across SLMs, prompts, and regions, highlighting the interaction between workload migration and grid carbon intensity.

Holistic Sustainability Ranking: SVI

SVI analysis (Figure 3) demonstrates that region and prompt engineering interact to determine composite eco-performance. While models targeting low-carbon regions trend above-median SVI, prompt and model configurations can counteract or exacerbate regional differences. The results validate SVI as a multidimensional, discriminative scoring function for test-automation scenario selection. Figure 3

Figure 3: SVI as a holistic index, showing performance and sustainability trade-offs across prompts, models, and regional infrastructure.

Quantization Trade-Offs

Figure 4 presents the sensitivity of SVI to quantization depth for Phi-3.5-mini and Qwen2.5-1.5B under APV3_{V_3}. Higher-precision variants (unquantized, 8-bit) achieve superior SVI in fixed regions (e.g., Singapore for Phi-3.5-mini), implying that in these settings, coverage and stability gains from precision outweigh energy savings from aggressive quantization. However, SVI is critically modulated by region; for Qwen2.5-1.5B, 8-bit deployment in the lowest-carbon grid achieves the highest SVI, even compared to unquantized runs in higher-intensity grids. Figure 4

Figure 4: Quantization effects on SVI, illustrating model- and region-dependent sustainability–performance dynamics.

Coverage vs. Eco-Efficiency Prioritization

GFβ_\beta-based rankings show that model ordering is non-monotonic with respect to β\beta and deployment context. SLMs optimized for maximal code coverage do not necessarily yield best-in-class sustainability; conversely, the most eco-efficient models may underperform on strict code coverage if not calibrated. This property enables explicit architectural negotiation for software architects: deployment pipelines can be tuned to reflect deployment priorities (coverage vs. ecological footprint) simply by adjusting β\beta and controlling for anticipated regional workload placement.

Implications and Future Prospects

This study rigorously demonstrates that sustainable SLM inference for automated test generation is driven not by model size alone but chiefly by the triadic interplay of model quantization, prompt design, and carbon intensity of the execution environment. Strong empirical evidence is provided for a multidimensional ranking and selection approach, rather than simplistic parameter- or metric-based model filtering. For practitioners, the SVI and GFV3_{V_3}0 indices support rational, priority-aware choice of SLM deployments in resource-constrained, sustainability-sensitive settings (including edge, privacy domains, or private cloud).

Theoretically, the results challenge the assumption that energy/resource trade-offs are linear in model parameterization or quantization; future work could extend these findings to alternate hardware (e.g., TPUs, ARM), cloud substrate, software regression testing, and compositional prompt workflows.

Conclusion

The empirical evaluation of prompt-driven SLM-based test script generation in (2604.02754) substantiates that neither model size nor emission metrics alone suffice for credible sustainability evaluation in automated software testing workflows. Only by analyzing the dynamic interactions among model, prompt, quantization, and regional deployment scenarios can software engineers, sustainability officers, and cloud architects meaningfully balance functional Quality-of-Service guarantees and sustainable operational footprints. The composite metrics developed in the study set a new methodological baseline for further sustainable AI engineering research.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.