Testora: NLP-Driven Regression Testing System

Updated 6 February 2026

Testora is an automated regression-testing system that uses natural language from pull requests to generate and classify tests based on developers’ stated intent.
It employs a multi-stage pipeline including PR filtering, targeted test generation, differential test execution, and LLM-based classification to enhance precision.
Its empirical evaluation demonstrates reduced false positives, improved regression detection, and cost-effective integration into continuous integration workflows.

Testora is an automated regression-testing system that leverages natural-language information from pull requests to accurately detect unintended behavioral changes in evolving software projects. Unlike traditional regression testing, Testora integrates LLMs to generate and classify tests using the explicit natural-language intent associated with each code change, reframing regression detection around what developers state should change rather than flagging all behavioral differences as suspect.

1. Motivation and Conceptual Underpinnings

The conventional approach to regression testing treats any behavioral difference between software revisions as potential regression, resulting in large numbers of false positives, since most changes are deliberate (e.g., bug fixes, feature additions). Testora’s key insight is to exploit the natural-language artifacts produced during software development—PR titles, descriptions, commit messages, and discussions—as an implicit test oracle. By comparing the intended changes outlined in these artifacts with the observed behavioral differences (as surfaced by generated tests), Testora selectively highlights only those behavioral changes not aligning with the stated intent, substantially reducing developer alert fatigue and improving the precision of regression detection (Pradel, 24 Mar 2025).

2. System Inputs and Pipeline Architecture

Testora formalizes each pull request as a tuple $pr = (t, d, \Delta, m_c, m_d)$ , where $t$ is the title, $d$ the description, $\Delta$ the code diff, $m_c$ the commit messages, and $m_d$ the discussion. The output is either

(“unintended”, $c$ , $e$ ): a minimal, automatically generated test $c$ that exposes a regression, accompanied by a natural-language explanation $e$ , or
(“intended”): indicating all observed differences match stated intentions.

The pipeline comprises four main stages:

PR Filtering: Exclude PRs that modify only documentation or non-source files, or those that touch more than three source files. Disregard changes affecting only comments or labeled as “DOC.”
Targeted Test Generation:

Static analysis extracts modified function names and relevant diffs.
An LLM is prompted (GPT-4o-mini by default) to generate 10 “normal-usage” and 10 “corner-case” tests per PR, requiring use of only public APIs and avoidance of non-deterministic constructs.
Tests invoking private (underscore-prefixed) functions or containing undefined references are pruned; the LLM fills in missing imports/definitions for the remainder.

Differential Testing: The software is built at both the pre-PR (env_old) and post-PR (env_new) commits. Each generated test $c$ is executed in both environments, yielding outputs $o_{\text{old}}$ and $o_{\text{new}}$ . Tests with $o_{\text{old}} \neq o_{\text{new}}$ compose $\Delta_{\text{tests}}$ . Exception consistency, flakiness, and test minimization via iterative trimming are applied. The test is also rerun on the latest codebase (env_latest) to filter out issues resolved elsewhere.
LLM-Based Classification: Each $c \in \Delta_{\text{tests}}$ is submitted to an LLM classifier together with project and PR context, the test, its outputs, and docstrings for all invoked functions. A multi-question prompt queries the model on: (1) Noteworthiness, (2) Determinism, (3) API usage (public/private), (4) Legality of inputs, (5) Alignment with PR intent. Only if all criteria are met and a difference is deemed unintended is a regression reported.

3. Natural-Language Oracle and Classification Method

Testora’s classification scheme eschews probabilistic scoring in favor of a multi-question, structured prompt that systematizes the LLM’s reasoning. The five-point prompt assesses each test difference for significance, determinism, correct API usage, input validity, and congruence with PR-stated intent. Only when a difference is significant ( $a_1$ = noteworthy), deterministic ( $a_2$ ), the test uses only public APIs ( $a_3$ ), passes docstring-based input constraints ( $a_4$ ), and is not supported by the PR’s intention ( $a_5$ = unintended) is it reported as a regression.

This multi-faceted prompting architecture is explicitly designed to mitigate both false positives (e.g., flakiness, inadequate test minimization, exceptions) and false negatives (misclassification due to imperfect PR documentation or LLM misinterpretation). PR metadata is transformed and consolidated for the classifier, ensuring maximum exposure to developer-stated intent and project-local context. The structured prompt output (in JSON) enables deterministic application of the decision rule without probability thresholds.

4. Experimental Evaluation and Empirical Results

Testora was evaluated on four large Python repositories: keras, marshmallow, pandas, and scipy. Across 1,274 PRs (filtered to 525 for analysis), Testora’s workflow surfaced 30 PRs with behavioral differences not matching PR descriptions: 19 true regressions and 11 coincidental fixes (instances where undocumented bug fixes occurred as a collateral effect of the PR).

Per-PR statistics detailed in the experimental section include:

Repository	PRs Examined	Mean Tests Generated	Mean Unique Δ_tests	Unintended Changes Detected
keras	500	~30	~20	7 regressions, 2 fixes
marshmallow	274	—	—	—
pandas	500	—	—	—
scipy	500	—	—	7 regressions, 2 fixes

Classifier accuracy on a labeled dataset of 164 entries (139 intended, 25 unintended) shows that the multi-question prompt substantially improves precision and $F_1$ over single-question baselines: For GPT-4o-mini, precision is 58%, recall 60%, and $F_1$ = 59%. The multi-question architecture captures a more nuanced mapping between PR intent and test-detected differences.

Of 13 still-present regressions reported to project maintainers, 10 were independently confirmed and 8 fixed. Concrete regression examples include misclassification of NaN elements in Keras (keras#19814) and input type rejections in SciPy (scipy#19263) (Pradel, 24 Mar 2025).

5. Performance, Scalability, and Cost

The average per-PR resource utilization consists of approximately 9,440 LLM tokens (5,818 in prompts, 3,622 in outputs), incurring an average cost of $0.003 at GPT-4o-mini rates. The mean end-to-end runtime is 12.3 minutes per PR, normalized as 39.5% of the median GitHub CI time for a given PR. The runtime model is approximately linear in the number of tests: total time$\approx T + \alpha N_{\text{tests}} $, with$ T $for orchestration/build and$ \alpha $for per-test execution. A similar relation holds for token-based LLM cost ($ \beta N_{\text{tests}}$).

The cost/delay profile and automation make Testora practical for use immediately before or shortly after code merges, providing an early lint for regressions not covered by ad hoc or traditional CI-based tests (Pradel, 24 Mar 2025).

6. Limitations and Applicability

Testora is currently optimized for projects with clear public APIs; applicability to GUI and interactive workflows remains unproven. Very large diffs and sparse PR documentation challenge both test generation and classification fidelity. As with all LLM-based systems, hallucinations and misinterpretations persist, though multi-question prompting and downstream filtering (e.g., checks for test flakiness, exception handling, and independent fix detection via env_latest reruns) serve as mitigations. PRs with terse or misleading descriptions erode the effectiveness of the natural-language oracle: the classifier’s accuracy depends directly on the quality of developer-provided intent.

A plausible implication is that development process adaptations, such as enriching PR descriptions and commit messages, may further amplify the effectiveness of techniques exemplified by Testora.

7. Significance and Research Outlook

Testora demonstrates the first systematic conversion of natural-language PR artifacts into a test oracle for software regression detection. By embedding the developer’s explicitly stated intent into the test generation and classification workflow, it bridges the semantic gap often missed by differential testing. Its integration of natural-language processing and code analysis foreshadows a landscape where automated quality assurance is context-aware and intention-aligned, making regression testing both quantitatively more precise and qualitatively more relevant to software evolution (Pradel, 24 Mar 2025). Future extensions may address broader modalities of software change, larger-scale diffs, and application to interactive/API-ambiguous domains.

Markdown Upgrade to Chat

References (1)

Regression Testing with a Natural Language Oracle (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Testora.