Bug Reproducible Tests (BRTs)

Updated 2 July 2026

Bug Reproducible Tests (BRTs) are tests designed with a fail-to-pass semantic that confirms a specific bug in the buggy version and its resolution in the fixed version.
They are constructed through techniques like version pair mining, LLM-based test generation, and black-box amplification, ensuring effectiveness across diverse programming contexts.
BRTs serve as ground-truth oracles in automatic program repair, fault localization, and empirical evaluation, supporting reproducible testing in CI environments and benchmark infrastructures.

A Bug Reproducible Test (BRT) is a test artifact that, by design, fails on a buggy program version and passes on its corresponding fixed version. BRTs are central to empirical software engineering: they rigorously demonstrate the existence of a specific defect, facilitate regression detection, and serve as ground-truth oracles for validation in automatic program repair (APR), fault localization, and broader software evaluation. BRTs appear in benchmark infrastructures, agentic repair pipelines, concurrency bug amplification frameworks, LLM-driven test generation, and across domains ranging from classical to quantum software.

1. Formal Definition and Core Properties

A BRT is characterized by its “fail-to-pass” semantic: let $C_b$ denote the buggy program and $C_f$ the fixed version. A test $t$ is a BRT if and only if:

$t(C_b) = \mathit{Fail} \quad \wedge \quad t(C_f) = \mathit{Pass}$

This fail–pass dichotomy is consistent across diverse contexts: unit tests in classical systems (Madeiral et al., 2019), concurrency test harnesses (Weiss et al., 28 Jul 2025), and statistical oracles in quantum programs (Campos et al., 2021). The BRT must be executable under both program revisions in an equivalent environment—whether defined by Maven/JUnit/JDK8 (Madeiral et al., 2019), CI-based containers (Tomassi et al., 2019), or framework-specific runners (e.g., Bazel at Google (Cheng et al., 3 Feb 2025)).

Table: BRT Criteria Across Representative Contexts

Context	Must Fail on Buggy	Must Pass on Fixed	Execution Environment
Java/Maven/JUnit (Bears)	Yes	Yes	CI/Maven/JDK8
Agentic APR (Google)	Yes	Yes	Bazel/internal CI
Python/pytest	Yes	Yes	pytest/test runner
Quantum (QBugs)	$D_b \not\approx D$	$D_c \approx D$	Simulator/hardware
Benchmarks (BugSwarm/GitBug)	Yes	Yes	Containerized CI image

All BRT orchestrations require that both code versions, as well as the test suite, are buildable and executable within a controlled, reproducible context to guarantee durable replayability (Madeiral et al., 2019, Tomassi et al., 2019, Saavedra et al., 2023).

2. Construction Pipelines and Methodologies

BRTs can be mined, synthesized, or generated via several technical workflows:

a) Version Pair Mining from CI Events

Frameworks such as Bears and BugSwarm automate BRT discovery by scanning CI logs for build pairs $(b_\text{fail}, b_\text{pass})$ where the first build fails due to a test case and the second passes following a code patch. The canonical procedure (Madeiral et al., 2019, Tomassi et al., 2019, Saavedra et al., 2023):

Identify candidate version pairs using CI pass/fail transitions and code/test diff heuristics.
Extract test suites and check for at least one test $t$ failing on $C_b$ and all tests passing on $C_f$ .
Archive artifacts, environment descriptions, and execution outputs for reproducibility.

b) Test Generation from Bug Reports (LLM/Hybrid Approaches)

LLM-based pipelines (LIBRO, AssertFlip, iCoRe, etc.) produce BRTs by (i) ingesting bug reports, (ii) constructing test cases likely to expose the defect, and (iii) validating generated tests against buggy and fixed releases (Kang et al., 2022, Qureshi et al., 6 Oct 2025, Wang et al., 21 Apr 2026, Khatib et al., 23 Jul 2025). Approaches include:

Direct generation, where LLMs synthesize failing test cases from a prompt.
Pass-then-invert (AssertFlip), where LLMs produce a passing test and programmatic inversion of its oracle yields a BRT for the bug (Khatib et al., 23 Jul 2025).

c) Black-Box Amplification (for Nondeterministic/Heisenbugs)

In concurrent systems, BRTs often require systematic search for test configurations that amplify bug manifestation probability. Methods employ black-box sampling, ensemble regression surrogates, and iterative search to identify parameter vectors $C_f$ 0 yielding maximal failure rates, then validate these as robust BRTs (Weiss et al., 28 Jul 2025).

d) Joint Cogeneration in Agentic/LLM-Driven APR

Modern agentic APR agents cogenerate both a plausible fix and its BRT in a single editing session, using multi-step prompting and action loops. Cogenerated BRTs are validated to fail on $C_f$ 1 and pass on the generated or oracle-fixed code, and then are packaged alongside the fix patch (Cheng et al., 27 Jan 2026).

3. Benchmarks, Automation, and Large-Scale Reproducibility

The persistence, transparency, and diversity of BRTs have been enabled by systematic benchmarks:

a) CI-Mined Benchmarks

Bears: 251 BRTs from 72 projects, discovered via Travis-CI build analysis and manual validation, stored as Git branches with full reproduction metadata; median patch touches eight lines, with 6.9% success rate for all candidate pairs (Madeiral et al., 2019).
BugSwarm: 3,091 CI-mined fail–pass job pairs in Java and Python, encapsulated as Docker images. Reproducibility is enforced by repeated test executions in the archived containers; ~5.56% reproduction success rate (Tomassi et al., 2019).
GitBug-Actions: Extends the paradigm to Go and other languages using GitHub Actions as the discovery and execution layer, applying deterministic, locally-executable container snapshots for indefinite testability (Saavedra et al., 2023).

b) Quantum Benchmarks

QBugs: Adapts the CI benchmark paradigm to quantum algorithms, parameterizing BRTs by $C_f$ 2 and enforcing reproducibility via simulator or hardware-executed statistical tests (Campos et al., 2021).

c) Agentic Industrial Benchmarks

At Google, proprietary corpora (Java, C++, Go, Python, etc.) anchor BRT evaluation in monolithic, Bazel-based repositories, challenging LLM-based approaches to conform to internal build, test, and API conventions (Cheng et al., 3 Feb 2025).

4. Automation, LLM-Based Generation, and Retrieval

Recent work has explored automated generation of BRTs directly from bug reports and code context using LLM-driven, agentic, or hybrid pipelines.

Prompt-based Generation: LIBRO feeds bug reports and few-shot test examples to a large code model (e.g. Codex), validates outputs, clusters by failure messages, and ranks BRT candidates (Kang et al., 2022). Succinct heuristics and clustering complement LLM outputs for high BRT precision.
Iterative, Correlation-Aware Retrieval: iCoRe coordinates differentiated retrieval of source and test context, incorporates function-call graph similarity, and iteratively refines retrieval and generation using LLMs, achieving 42.0–52.8% fail-to-pass rates on real-world Python benchmarks and outperforming prior retrievers (Wang et al., 21 Apr 2026).
Pass-then-Invert Strategies: AssertFlip generates a passing test on buggy code, then algorithmically inverts its oracle to obtain a BRT. This outperforms direct generation in F→P success on the SWT-Bench-Verified corpus (43.6%) (Khatib et al., 23 Jul 2025).
Agentic Cogeneration: At scale, LLM-driven APR agents cogenerate fixes and accompanying BRTs; patch selection strategies incorporate BRT presence to maximize both correct repair and reproduction (Cheng et al., 27 Jan 2026).

A recurring challenge is context quality: LLM-based pipelines are highly sensitive to identifier masking, code mutations, or incomplete retrieval, with performance dropping by up to 71% under identifier mutations (Qureshi et al., 6 Oct 2025). Dynamic, open-book retrieval and real-time feedback loops mitigate these issues but amplify computational cost (Qureshi et al., 6 Oct 2025, Wang et al., 21 Apr 2026).

5. Metrics, Empirical Characteristics, and Analysis

The empirical quality of BRTs is assessed via their fail-to-pass rate, coverage properties, fault detection, and reproducibility.

a) Core Metrics

Fail-to-pass reproduction rate: Proportion of bugs for which at least one valid BRT is generated or captured (e.g., LIBRO 33.5% (Kang et al., 2022), AssertFlip 43.6% (Khatib et al., 23 Jul 2025), iCoRe 42.0% (Wang et al., 21 Apr 2026)).
Coverage statistics: Patch line coverage, assertion count, and use of weak/strong oracles (Hora et al., 3 Feb 2026, Khatib et al., 23 Jul 2025).
Validation process: End-to-end validation involves confirming that the BRT fails on the buggy version and passes on the fixed; most toolchains require automated or manual confirmation (Madeiral et al., 2019, Cheng et al., 3 Feb 2025).

b) Empirical Properties

In a large-scale study on Python projects, BRTs exhibit no significant difference in LOC, assertion count, or control-flow complexity from non-BRTs, though they marginally favor try/except patterns and weak assertions (6% vs 2% usage) (Hora et al., 3 Feb 2026).
The mapping from bugs to tests is almost one-to-one (95% of BRTs target a single bug) (Hora et al., 3 Feb 2026).
Project diversity among CI-mined BRTs spans codebase sizes from ~0.8 KLOC to ~205 KLOC and test suite sizes from 16 to 8,066 (Madeiral et al., 2019).

c) Practical Reproducibility

Tooling ensures durable test reproduction via containerization or local CI execution (e.g. frozen Docker images in BugSwarm or GitBug-Actions (Tomassi et al., 2019, Saavedra et al., 2023)).
Repeated execution, flakiness filtering, and environment snapshotting are critical to confirm stable behavior (Tomassi et al., 2019, Saavedra et al., 2023).

6. Limitations, Open Problems, and Future Directions

Despite methodological advances, BRT generation and benchmarking face several challenges:

Environment and Framework Limitations: Many CI-based datasets are limited to Java, Python, or Go projects with standardized CI and build systems (Maven, Bazel, pytest) (Madeiral et al., 2019, Tomassi et al., 2019, Saavedra et al., 2023). Porting to other ecosystems requires extensive engineering.
Rarity and Quality of Human-Written BRTs: In practice, only a minority of bug reports include a pre-written BRT—4% in Defects4J and even fewer in industrial settings—motivating automatic generation research (Cheng et al., 3 Feb 2025).
Context Quality and Retrieval Robustness: LLM-driven pipelines are highly sensitive to retrieval failures; function-call–aware, iterative retrievers such as iCoRe are actively studied to mitigate this (Wang et al., 21 Apr 2026, Khatib et al., 23 Jul 2025).
Bugs with Elusive Manifestations: Rare-event bugs (e.g., concurrency Heisenbugs) may require black-box amplification and statistical search to obtain meaningful BRTs (Weiss et al., 28 Jul 2025).
Framework and Organism Maintenance: Long-term reproducibility is threatened by CI evolution, dependency staleness, and repository refactorings—mitigated via container snapshotting and reproducibility checks (Tomassi et al., 2019, Saavedra et al., 2023, Campos et al., 2021).
Empirical Gaps: It remains an open problem whether BRTs consistently achieve higher code or mutant coverage than standard tests. Empirical studies to date suggest little structural difference, but propose targeted reductions and stronger oracles to maximize diagnostic power (Hora et al., 3 Feb 2026).

Future directions focus on extending language and CI support (Saavedra et al., 2023), generalizing LLM-based BRT synthesis across languages and styles (Qureshi et al., 6 Oct 2025, Wang et al., 21 Apr 2026), developing hybrid and agentic generation workflows (Cheng et al., 3 Feb 2025, Cheng et al., 27 Jan 2026), and incorporating dynamic, interactive clarification in LLM reasoning (Qureshi et al., 6 Oct 2025).

7. Applications and Significance in Research and Practice

BRTs underpin a broad range of empirical research and development activities:

Benchmarking and Comparative Evaluation: BRT-rich benchmarks such as Bears, BugSwarm, GitBug-Actions, and QBugs provide ground-truth for APR, fault localization, and test minimization studies, directly influencing tool development and assessment (Madeiral et al., 2019, Tomassi et al., 2019, Saavedra et al., 2023, Campos et al., 2021).
Program Repair Workflows: BRTs act as precise oracles to confirm fix correctness, and their automated cogeneration within APR pipelines increases both fix plausibility and reviewer confidence (Cheng et al., 3 Feb 2025, Cheng et al., 27 Jan 2026).
Test Quality and Reduction: Empirical insights suggest practitioners should prefer single-bug, reduced-scenario, strong-assertion BRTs for optimal localization and diagnostic value (Hora et al., 3 Feb 2026).
Concurrency and Nondeterminism Research: Statistical BRT construction enables controlled surfacing of rare-event and Heisenbugs for research on robustness and systemic reliability (Weiss et al., 28 Jul 2025).
Cross-domain Extension: The BRT concept is now evolving into quantum, agentic, and multi-modal systems, with open challenges in measuring, reproducing, and benchmarking such complex test scenarios (Campos et al., 2021).

BRTs now serve as a unifying artifact for large-scale, empirical, and automated evaluation of bugs, fixes, and software robustness, anchoring both foundational and applied research in software engineering.