Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem (2510.09907v1)

Published 10 Oct 2025 in cs.SE and cs.AI

Abstract: Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cross-function properties from code and documentation, synthesizes and executes PBTs, reflects on outputs of these tests to confirm true bugs, and finally outputs actionable bug reports for the developer. We perform an extensive evaluation of our agent across 100 popular Python packages. Of the bug reports generated by the agent, we found after manual review that 56\% were valid bugs and 32\% were valid bugs that we would report to maintainers. We then developed a ranking rubric to surface high-priority valid bugs to developers, and found that of the 21 top-scoring bugs, 86\% were valid and 81\% we would report. The bugs span diverse failure modes from serialization failures to numerical precision errors to flawed cache implementations. We reported 5 bugs, 4 with patches, including to NumPy and cloud computing SDKs, with 3 patches merged successfully. Our results suggest that LLMs with PBT provides a rigorous and scalable method for autonomously testing software. Our code and artifacts are available at: https://github.com/mmaaz-git/agentic-pbt.

Summary

The paper introduces an automated methodology using an agentic LLM to generate property-based tests for identifying bugs in Python packages.
It details a modular six-step cycle integrating with Hypothesis that analyzes, tests, and reports bugs with actionable insights.
Experimental evaluation shows cost-efficient detection with 56% valid bug reports and significant results in key Python libraries.

Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem

Introduction

The paper "Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem" introduces an automated methodology for detecting bugs in Python packages using an Agentic LLM-based approach. This method leverages property-based testing (PBT) frameworks that automatically generate inputs to validate whether software properties hold over all defined inputs, distinguishing itself from traditional example-based testing. The distinctive contribution of this paper is the development of an advanced agent that autonomously handles the entire process of bug detection, from code analysis to the generation of actionable bug reports.

Agent Design and Functionality

The agent is engineered on top of Anthropic's Claude Code, a coding agent, and is intended to function as a natural language prompt interfacing with Python's Hypothesis library. The agent operates in a modular six-step cycle:

Analyze the Target: Identifies whether the testing target is a Python module, file, or function.
Understand the Target: Utilizes documentation and code to infer function-specific properties.
Propose Properties: Derives high-value properties and proposes tests.
Write Tests: Translates proposed properties into Hypothesis property-based tests.
Execute and Triage Tests: Runs tests and evaluates validity based on outcomes.
Report Bugs: Composes detailed bug reports for valid bugs, including potential patches.

The agent's critical feature is its ability to reflect on failing tests, thus ensuring that false alarms are minimal and that proposed properties have substantial evidential backing.

Experimental Evaluation

The evaluation involved selecting a diverse corpus of 100 widely-used Python packages, covering various domains including data processing and cloud computing. The experimental infrastructure included isolated environments with Hypothesis and dependencies, and the testing utilized Claude Opus 4.1 as the LLM.

Sample Selection: Involved formally selected standard library components and top PyPI packages.
Execution Metrics: A total runtime of approximately 137 hours was required, with over 2.21 billion tokens processed.

During manual validation, the findings revealed that 56% of bug reports were valid bugs, with a significant fraction being of sufficient severity to report back to maintainers.

Results and Findings

The key results indicate a high success rate in identifying genuine bugs:

Precision and Costs: With an approximate cost of $5.56 per bug report, the work proves cost-efficient for identifying bugs autonomously.
Reported Bugs: Approximately 81% of high-priority bug reports were valid for reporting, involving essential Python libraries like NumPy and AWS SDKs.

Furthermore, the agent's ability to autonomously yield bug reports, some of which were merged into project repositories, signifies a novel contribution to software testing paradigms.

Discussion and Future Work

The paper successfully demonstrates the utility of LLM-based PBT as a scalable method for software auditing. It suggests that future work could focus on reducing false-negative rates through improved intent comprehension and autonomy in decision-making processes for bug identification. Furthermore, as LLMs evolve, scalability and precision in property synthesis may further enhance the capability of PBT.

The implications of this approach extend beyond autonomous bug discovery to potentially revealing software vulnerabilities. Future studies may also explore integrating this agentic framework into CI/CD pipelines to provide real-time, automated testing feedback in software development cycles.

Conclusion

This research establishes a practical framework for employing LLMs in property-based testing to uncover software bugs, proving that AI-driven approaches can fundamentally bolster the robustness and reliability of widely-used software libraries. It paves the way for future explorations in automated software testing, presenting vast opportunities for refinement and application across varied software ecosystems.