Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Published 30 May 2026 in cs.SE | (2606.00530v1)

Abstract: Testing is a core activity in software development workflows, and research on its automation has spanned several decades. Most existing approaches generate unit tests for individual methods, validate isolated API endpoints, or target user interface (UI) layers, with non-API and non-UI automated test generators typically exercising only a single focal method. Recent empirical evidence shows a substantial gap between such generated tests and developer-written ones, which often span multiple focal methods, involve complex call sequences, and contain elaborate assertions that current automated approaches fail to capture. To address this gap, we propose generating tests from natural language (NL) descriptions of developer intent. We present Sakura, the first agent-based framework for generating structurally complex test cases from NL descriptions. Sakura decomposes NL descriptions into structured blocks and processes them using a multi-agent system consisting of a localization agent that grounds test steps in concrete application code via static analysis, a composition agent that synthesizes compilable test code and iteratively refines it using execution feedback, and a supervisor agent that coordinates agent interactions. To evaluate Sakura, we curate a novel dataset of NL test descriptions at three levels of abstraction, systematically generated from developer-written tests mined from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperforms off-the-shelf agentic tools such as Gemini CLI. Specifically, Sakura achieves 50-78% higher test compilability and 38-66% higher coverage overlap with ground-truth tests compared to baselines using the same models. Moreover, Sakura paired with small open-source models such as Devstral Small 2 and Qwen3-Coder outperforms Gemini CLI using large proprietary models, while also being more cost-effective.

Summary

  • The paper presents a multi-agent NL-to-test generation framework that decomposes natural language descriptions into BDD-style blocks for structurally complex testing.
  • It uses specialized agents for grounding, code synthesis, and iterative validation, achieving 50–78% higher compilability and 38–66% improved coverage over baselines.
  • The approach is scalable, cost-effective, and extensible to other statically-typed languages, marking a significant advancement in automated test generation.

Sakura: Multi-Agent NL-to-Test Generation for Structurally Complex Scenarios

Motivation and Problem Statement

Automated test generation has historically focused on unit tests targeting individual methods or classes, and API/UI-centric test suites, leaving a substantial gap relative to the structural complexity of developer-written tests in real-world codebases. Empirical evidence reveals that manually constructed tests frequently involve multi-class/multi-method interactions, complex setup/teardown procedures, and dense assertion logic, which most ATG tools—including conventional LLM-driven agents—fail to capture (Pan et al., 30 Sep 2025). The mismatch is evident in the stark divergence between generated and developer-written tests with respect to focal method coverage, behavioral fidelity, and assertion naturalness.

Initiating test generation from natural language (NL) descriptions of developer intent, rather than code-centric heuristics, offers a path toward bridging this gap. However, standard LLM-based agentic tools (e.g., Gemini CLI) lack structural grounding mechanisms, often producing semantically plausible but syntactically invalid or behaviorally incorrect test cases. Addressing this, Sakura introduces a multi-agent system with coordinated decomposition, grounding, code synthesis, and iterative validation, thereby enabling the generation of structurally complex tests directly from NL input.

Framework Architecture

Sakura operationalizes NL-driven test generation through a pipeline consisting of: code index creation, NL decomposition, and an agentic workflow combining localization, composition, and supervisory coordination. Figure 1

Figure 1: The Sakura framework converts NL test descriptions into BDD-style blocks and synthesizes executable, structurally complex test code via specialized agents.

Code Indexing

A onetime vectorization of the application codebase using Qwen3 Embedding 8B enables retrieval of relevant methods/classes. It tracks both declaring and containing classes for method context, facilitating accurate resolution of inherited, overloaded, and context-sensitive behavioral elements.

NL Decomposer

Leveraging LLM prompting, Sakura transforms free-form NL descriptions into BDD-inspired structured blocks, distinguishing setup, scenario (with precondition/action/assertion), and teardown phases. Each block records task description, explicit data flow, and flags for external dependencies. Figure 2

Figure 2: BDD-style decomposition of a circular inheritance test scenario, showing separation of setup, scenario, and teardown steps.

Multi-Agent Workflow

Localization Agent

Grounds each BDD step in candidate classes/methods via semantic search, static analysis, and code inspection. It annotates uncertainty and ensures external dependencies are handled robustly.

Composition Agent

Synthesizes and validates Java test code iteratively, using build-system feedback (compilation/exception/assertion reports) to refine code and correct errors (e.g., type mismatches, missing imports, assertion faults).

Supervisor Agent

Coordinates agentic execution, orchestrating subagent calls, managing decomposition state, and terminating upon successful test compilation/execution. It diagnoses failures and triggers targeted re-localization or composition correction.

Evaluation Dataset

Sakura’s evaluation dataset comprises 1,464 NL test descriptions at three abstraction levels paired with 488 developer-written tests from 20 Apache Commons projects. The descriptions are generated post-training cutoff dates of evaluated LLMs to eliminate data leakage and synthesized via Sonnet 4.5 for abstraction stratification. Figure 3

Figure 3: Three-phase pipeline for dataset creation—repository filtering, static context extraction, and abstraction-level NL generation.

The dataset is notable for its structural complexity (median 2 focal methods per test), supporting evaluation along specificity and behavioral diversity axes.

Empirical Results

Sakura is evaluated against Gemini CLI (baseline) using both proprietary (Gemini 2.5 Pro/Flash) and open-source (Qwen3-Coder, Devstral Small 2) models. Metrics include compilability, coverage overlap, structural fidelity, abstraction/complexity sensitivity, tool invocation patterns, and cost.

Compilability

Sakura attains 50–78% higher compilability than Gemini CLI, achieving 97.3% for Gemini 2.5 Pro and 85.4% with Devstral Small 2. Open-source models under Sakura outperform proprietary Gemini CLI settings.

Coverage Overlap

Sakura yields 38–66% higher coverage overlap than baseline, with Qwen3-Coder achieving 73.2% branch coverage versus 46.3% for Gemini CLI with Gemini 2.5 Pro.

Structural Fidelity

Recall for instantiated types, assertion types, invoked methods, and focal methods is 31–72% higher for Sakura. Assertion recall is robust across abstraction levels, whereas focal method recall degrades under high abstraction but remains superior (45.5% vs. 18.2% for Gemini CLI). Figure 4

Figure 4: Structural fidelity metric comparison for recall across test properties.

Abstraction Sensitivity

While lower abstraction descriptions yield higher quality, Sakura maintains substantial performance advantages at all levels. With high abstraction, focal method recall remains above twice that of Gemini CLI. Figure 5

Figure 5: Robustness of coverage and structural fidelity across abstraction levels.

Complexity Sensitivity

Performance stability under focal method count demonstrates Sakura’s resilience: 5+ focal tests achieve 64–85% recall/coverage with Gemini 2.5 Pro, compared to 29–37% for Gemini CLI. Figure 6

Figure 6: Complexity sensitivity analysis as focal method count increases.

Tool Usage and Cost

Sakura’s tool utilization is balanced across categories (retrieval, inspection, generation, validation, orchestration) irrespective of model, in contrast to Gemini CLI’s model-dependent skew.

Cost analysis reveals Sakura with Qwen3-Coder costs half as much per test as Gemini CLI with Gemini 2.5 Pro, with empirical performance advantages, validating the multi-agent decomposition’s efficiency/capacity tradeoff.

Implications and Future Directions

Sakura demonstrates the efficacy of domain-specific, multi-agent decomposition for NL-to-code test generation, enabling smaller LLMs to outperform larger commodity agents when aligned with curated toolsets and explicit workflow scaffolding. This has practical implications for scalable, cost-effective automated testing deployment, particularly in CI/CD ecosystems.

The architecture is theoretically extensible to other statically-typed languages and NL-driven API or bug-report-based test generation. Future work can explore uncertainty quantification, fine-tuning/merging of smaller models, adaptive decomposition strategies, and broader domain adaptation.

Conclusion

Sakura provides a scalable, empirically validated approach for generating structurally complex Java tests from NL descriptions, outperforming baseline LLM agentic tools in compilability, behavioral fidelity, and cost efficiency. Structured multi-agent pipelines and code-grounded decomposition are essential for advancing NL-to-test generation beyond unit-method scope, with implications for future research in test automation, REPL-based validation, and agentic software engineering benchmarks (2606.00530).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.