- The paper presents PROPGEN, a system using LLMs to automatically generate and refine properties for scalable mobile app testing, identifying 25 unique bugs.
- It integrates functionality-guided exploration with natural language property synthesis and evidence-driven refinement to enhance execution correctness and testing coverage.
- Empirical evaluation shows over 90% property validity and significantly improved bug detection compared to traditional automated testing approaches.
Automated LLM-Based Property Generation for Mobile App Testing: An Expert Analysis
Context and Motivation
Functional bugs in mobile applications frequently manifest through non-crashing, incorrect behaviors rooted in specific user interaction sequences. Traditional testing regimes relying on manual tests or automated code coverage are limited in their detection scope due to the absence of explicit behavioral specifications and the oracle problem. Property-based testing (PBT) addresses these issues by verifying general behavioral properties—abstract specifications of intended functionalities—across numerous GUI states. However, the practical utility of PBT in real-world mobile app development is fundamentally constrained by the need for labor-intensive, manually generated properties. The paper "From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing" (2604.13463) introduces PROPGEN, an end-to-end system leveraging LLMs for automated property construction, thereby enabling scalable, specification-driven testing of Android applications.
Technical Approach
PROPGEN synthesizes executable properties for property-based testing via three sequential stages:
Functionality-Guided Exploration and Behavioral Evidence Construction
PROPGEN performs hypothesis-driven exploration by systematically inferring actionable functionalities from GUI states, grounding them in explicit UI widget references. Multimodal LLMs are triggered only when new UI contexts are encountered to infer functionality hypotheses, minimizing token consumption and redundancy. Targeted execution of these functionalities prioritizes main app behaviors, uniqueness, and contextual feasibility. Each execution trace is incrementally summarized into structured behavioral evidence containing state snapshots, interaction descriptors, state diffs, and outcome labels. This behavioral abstraction forms the basis for downstream property synthesis.
Property Synthesis from Behavioral Evidence
From summarized traces, PROPGEN elicits natural-language property specifications reflecting generalized (precondition, interaction, postcondition) relations. Critically, preconditions are grounded in observable UI evidence; interactions capture user event sequences; postconditions assert GUI effects verifiable across runtime states while filtering brittle, trace-specific details. The natural language specifications are then translated into executable properties compatible with property-based testing frameworks such as KEA, operationalizing behavioral rules for automated testing.
Feedback-Driven Refinement of Imprecise Properties
Recognizing that properties inferred from single traces are susceptible to overgeneralization or excessive specificity, PROPGEN incorporates feedback-driven refinement. Upon test-triggered false positives, the system diagnoses which property component (precondition, interaction, postcondition) is at fault by comparing the original behavioral evidence against failure traces. Targeted relaxations or strengthenings are applied to assure property robustness across legitimate executions without deviation from original functional intents.
Empirical Evaluation and Numerical Results
PROPGEN was evaluated on 12 diverse, popular Android apps, selected for both subject diversity and legacy inclusion in prior functional bug studies. Using GPT-5.2 as the backend LLM and KEA for property execution, PROPGEN achieved:
- Functionality Coverage: Inferred 1,282 functionalities, with 1,210 (94.4%) validated and 977 (76.2%) correctly executed, outperforming DroidAgent which had 575 inferred, 491 valid, and only 187 correctly executed functionalities.
- Property Generation: Generated 985 properties, 912 (92.6%) validated as semantically correct by expert annotators. Most properties exceeded 90% validity.
- Refinement Effectiveness: Of 127 properties producing spurious violations during testing, 118 (92.9%) were successfully refined. Most modifications concerned preconditions or postconditions, not interaction sequences.
- Bug Detection: Uncovered 25 unique previously undocumented functional bugs—only 3 (12%) were found by prior methods (Genie, Odin, PBFDroid, VisionDroid).
- LLM Cost: Each behavioral evidence construction and property synthesis call averaged 6,198 tokens / $0.0145; total cost per app averaged$14.86. Refinement required on average 8,768 tokens / $0.02 per property.
Comparative Analysis and Claims
PROPGEN’s hypothesis-driven, widget-grounded exploration yields superior coverage and execution correctness compared to prior agent-based systems. The structured behavioral summarization ensures high property fidelity, as opposed to the loosely associated functionality hypotheses of baseline approaches. Strong numerical results substantiate the claim that automated property generation can substantially scale property-based testing, identifying bugs beyond the coverage of existing automated and semi-automated testing methods.
The property refinement mechanism demonstrates that nearly all false positives arising from LLM-inferred properties can be mitigated through minimal, evidence-driven adjustments, highlighting the practical viability of automated specification construction.
PROPGEN's bug-finding capability is complementary to established functional testing tools: most bugs uncovered are outside their detection scope, attributable to a broader and more context-sensitive exploration enabled by generalized properties.
Implications and Future Directions
From a practical standpoint, PROPGEN reduces the human effort required for specification construction, promoting property-based testing adoption in continuous integration workflows. Theoretical implications extend to automated behavioral specification synthesis: the use of multimodal LLMs for contextual understanding, coupled with structured evidence abstraction, supports a scalable framework for property generation beyond mobile apps—potentially applicable to web GUI testing and automated documentation generation.
Future developments may focus on incremental learning across app versions, reinforcement-driven exploration, integration of richer multimodal evidence, and extending property synthesis to cross-app or multi-platform behaviors. Challenges remain in scaling to highly dynamic or deeply nested UI contexts and minimizing LLM hallucination in non-standard GUIs.
Conclusion
PROPGEN expands the practical reach of property-based testing for mobile apps by automating the exploration, specification, and refinement of behavioral properties, validated through strong empirical performance and robustness to false positives (2604.13463). This LLM-based pipeline provides a scalable framework for specification-driven testing, capable of uncovering complex functional bugs missed by traditional approaches and poised to shape future methodologies in automated software verification.