Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Published 30 May 2026 in cs.SE | (2606.00530v1)

Abstract: Testing is a core activity in software development workflows, and research on its automation has spanned several decades. Most existing approaches generate unit tests for individual methods, validate isolated API endpoints, or target user interface (UI) layers, with non-API and non-UI automated test generators typically exercising only a single focal method. Recent empirical evidence shows a substantial gap between such generated tests and developer-written ones, which often span multiple focal methods, involve complex call sequences, and contain elaborate assertions that current automated approaches fail to capture. To address this gap, we propose generating tests from natural language (NL) descriptions of developer intent. We present Sakura, the first agent-based framework for generating structurally complex test cases from NL descriptions. Sakura decomposes NL descriptions into structured blocks and processes them using a multi-agent system consisting of a localization agent that grounds test steps in concrete application code via static analysis, a composition agent that synthesizes compilable test code and iteratively refines it using execution feedback, and a supervisor agent that coordinates agent interactions. To evaluate Sakura, we curate a novel dataset of NL test descriptions at three levels of abstraction, systematically generated from developer-written tests mined from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperforms off-the-shelf agentic tools such as Gemini CLI. Specifically, Sakura achieves 50-78% higher test compilability and 38-66% higher coverage overlap with ground-truth tests compared to baselines using the same models. Moreover, Sakura paired with small open-source models such as Devstral Small 2 and Qwen3-Coder outperforms Gemini CLI using large proprietary models, while also being more cost-effective.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a multi-agent NL-to-test generation framework that decomposes natural language descriptions into BDD-style blocks for structurally complex testing.
It uses specialized agents for grounding, code synthesis, and iterative validation, achieving 50–78% higher compilability and 38–66% improved coverage over baselines.
The approach is scalable, cost-effective, and extensible to other statically-typed languages, marking a significant advancement in automated test generation.

Sakura: Multi-Agent NL-to-Test Generation for Structurally Complex Scenarios

Motivation and Problem Statement

Automated test generation has historically focused on unit tests targeting individual methods or classes, and API/UI-centric test suites, leaving a substantial gap relative to the structural complexity of developer-written tests in real-world codebases. Empirical evidence reveals that manually constructed tests frequently involve multi-class/multi-method interactions, complex setup/teardown procedures, and dense assertion logic, which most ATG tools—including conventional LLM-driven agents—fail to capture (Pan et al., 30 Sep 2025). The mismatch is evident in the stark divergence between generated and developer-written tests with respect to focal method coverage, behavioral fidelity, and assertion naturalness.

Initiating test generation from natural language (NL) descriptions of developer intent, rather than code-centric heuristics, offers a path toward bridging this gap. However, standard LLM-based agentic tools (e.g., Gemini CLI) lack structural grounding mechanisms, often producing semantically plausible but syntactically invalid or behaviorally incorrect test cases. Addressing this, Sakura introduces a multi-agent system with coordinated decomposition, grounding, code synthesis, and iterative validation, thereby enabling the generation of structurally complex tests directly from NL input.

Framework Architecture

Sakura operationalizes NL-driven test generation through a pipeline consisting of: code index creation, NL decomposition, and an agentic workflow combining localization, composition, and supervisory coordination.

Figure 1: The Sakura framework converts NL test descriptions into BDD-style blocks and synthesizes executable, structurally complex test code via specialized agents.

Code Indexing

A onetime vectorization of the application codebase using Qwen3 Embedding 8B enables retrieval of relevant methods/classes. It tracks both declaring and containing classes for method context, facilitating accurate resolution of inherited, overloaded, and context-sensitive behavioral elements.

NL Decomposer

Leveraging LLM prompting, Sakura transforms free-form NL descriptions into BDD-inspired structured blocks, distinguishing setup, scenario (with precondition/action/assertion), and teardown phases. Each block records task description, explicit data flow, and flags for external dependencies.

Figure 2: BDD-style decomposition of a circular inheritance test scenario, showing separation of setup, scenario, and teardown steps.

Multi-Agent Workflow

Localization Agent

Grounds each BDD step in candidate classes/methods via semantic search, static analysis, and code inspection. It annotates uncertainty and ensures external dependencies are handled robustly.

Composition Agent

Synthesizes and validates Java test code iteratively, using build-system feedback (compilation/exception/assertion reports) to refine code and correct errors (e.g., type mismatches, missing imports, assertion faults).

Supervisor Agent

Coordinates agentic execution, orchestrating subagent calls, managing decomposition state, and terminating upon successful test compilation/execution. It diagnoses failures and triggers targeted re-localization or composition correction.

Evaluation Dataset

Sakura’s evaluation dataset comprises 1,464 NL test descriptions at three abstraction levels paired with 488 developer-written tests from 20 Apache Commons projects. The descriptions are generated post-training cutoff dates of evaluated LLMs to eliminate data leakage and synthesized via Sonnet 4.5 for abstraction stratification.

Figure 3: Three-phase pipeline for dataset creation—repository filtering, static context extraction, and abstraction-level NL generation.

The dataset is notable for its structural complexity (median 2 focal methods per test), supporting evaluation along specificity and behavioral diversity axes.

Empirical Results

Sakura is evaluated against Gemini CLI (baseline) using both proprietary (Gemini 2.5 Pro/Flash) and open-source (Qwen3-Coder, Devstral Small 2) models. Metrics include compilability, coverage overlap, structural fidelity, abstraction/complexity sensitivity, tool invocation patterns, and cost.

Compilability

Sakura attains 50–78% higher compilability than Gemini CLI, achieving 97.3% for Gemini 2.5 Pro and 85.4% with Devstral Small 2. Open-source models under Sakura outperform proprietary Gemini CLI settings.

Coverage Overlap

Sakura yields 38–66% higher coverage overlap than baseline, with Qwen3-Coder achieving 73.2% branch coverage versus 46.3% for Gemini CLI with Gemini 2.5 Pro.

Structural Fidelity

Recall for instantiated types, assertion types, invoked methods, and focal methods is 31–72% higher for Sakura. Assertion recall is robust across abstraction levels, whereas focal method recall degrades under high abstraction but remains superior (45.5% vs. 18.2% for Gemini CLI).

Figure 4: Structural fidelity metric comparison for recall across test properties.

Abstraction Sensitivity

While lower abstraction descriptions yield higher quality, Sakura maintains substantial performance advantages at all levels. With high abstraction, focal method recall remains above twice that of Gemini CLI.

Figure 5: Robustness of coverage and structural fidelity across abstraction levels.

Complexity Sensitivity

Performance stability under focal method count demonstrates Sakura’s resilience: 5+ focal tests achieve 64–85% recall/coverage with Gemini 2.5 Pro, compared to 29–37% for Gemini CLI.

Figure 6: Complexity sensitivity analysis as focal method count increases.

Tool Usage and Cost

Sakura’s tool utilization is balanced across categories (retrieval, inspection, generation, validation, orchestration) irrespective of model, in contrast to Gemini CLI’s model-dependent skew.

Cost analysis reveals Sakura with Qwen3-Coder costs half as much per test as Gemini CLI with Gemini 2.5 Pro, with empirical performance advantages, validating the multi-agent decomposition’s efficiency/capacity tradeoff.

Implications and Future Directions

Sakura demonstrates the efficacy of domain-specific, multi-agent decomposition for NL-to-code test generation, enabling smaller LLMs to outperform larger commodity agents when aligned with curated toolsets and explicit workflow scaffolding. This has practical implications for scalable, cost-effective automated testing deployment, particularly in CI/CD ecosystems.

The architecture is theoretically extensible to other statically-typed languages and NL-driven API or bug-report-based test generation. Future work can explore uncertainty quantification, fine-tuning/merging of smaller models, adaptive decomposition strategies, and broader domain adaptation.

Conclusion

Sakura provides a scalable, empirically validated approach for generating structurally complex Java tests from NL descriptions, outperforming baseline LLM agentic tools in compilability, behavioral fidelity, and cost efficiency. Structured multi-agent pipelines and code-grounded decomposition are essential for advancing NL-to-test generation beyond unit-method scope, with implications for future research in test automation, REPL-based validation, and agentic software engineering benchmarks (2606.00530).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Summary

Sakura: Multi-Agent NL-to-Test Generation for Structurally Complex Scenarios

Motivation and Problem Statement

Framework Architecture

Code Indexing

NL Decomposer

Multi-Agent Workflow

Localization Agent

Composition Agent

Supervisor Agent

Evaluation Dataset

Empirical Results

Compilability

Coverage Overlap

Structural Fidelity

Abstraction Sensitivity

Complexity Sensitivity

Tool Usage and Cost

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Summary

Sakura: Multi-Agent NL-to-Test Generation for Structurally Complex Scenarios

Motivation and Problem Statement

Framework Architecture

Code Indexing

NL Decomposer

Multi-Agent Workflow

Localization Agent

Composition Agent

Supervisor Agent

Evaluation Dataset

Empirical Results

Compilability

Coverage Overlap

Structural Fidelity

Abstraction Sensitivity

Complexity Sensitivity

Tool Usage and Cost

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research