From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing

Published 15 Apr 2026 in cs.SE | (2604.13463v1)

Abstract: Mobile apps often suffer from functional bugs that do not cause crashes but instead manifest as incorrect behaviors under specific user interactions. Such bugs are difficult to detect automatically because they often lack explicit test oracles. Property-based testing can effectively expose them by checking intended behavioral properties under diverse interactions. However, its use largely depends on manually written properties, whose construction is difficult and expensive, limiting its practical use for mobile apps. To address this limitation, we propose PropGen, an automated approach for generating properties for Android apps. However, this task is challenging for two reasons: app functionalities are often hard to systematically uncover and execute, and properties are difficult to derive accurately from observed behaviors. To this end, PropGen performs functionality-guided exploration to collect behavioral evidence from app executions, synthesizes properties from the collected evidence, and refines imprecise properties based on testing feedback. We implemented PropGen and evaluated it on 12 real-world Android apps. The results show that PropGen can effectively identify and execute valid app functionalities, generate valid properties, and repair most imprecise ones. Across all apps, PropGen identified 1,210 valid functionalities and correctly executed 977 of them, compared with 491 and 187 for the baseline. It generated 985 properties, 912 of which were valid, and repaired 118 of 127 imprecise ones exposed during testing. With the resulting properties, we found 25 previously unknown functional bugs in the latest versions of the subject apps, many of which were missed by existing functional testing techniques.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents PROPGEN, a system using LLMs to automatically generate and refine properties for scalable mobile app testing, identifying 25 unique bugs.
It integrates functionality-guided exploration with natural language property synthesis and evidence-driven refinement to enhance execution correctness and testing coverage.
Empirical evaluation shows over 90% property validity and significantly improved bug detection compared to traditional automated testing approaches.

Automated LLM-Based Property Generation for Mobile App Testing: An Expert Analysis

Context and Motivation

Functional bugs in mobile applications frequently manifest through non-crashing, incorrect behaviors rooted in specific user interaction sequences. Traditional testing regimes relying on manual tests or automated code coverage are limited in their detection scope due to the absence of explicit behavioral specifications and the oracle problem. Property-based testing (PBT) addresses these issues by verifying general behavioral properties—abstract specifications of intended functionalities—across numerous GUI states. However, the practical utility of PBT in real-world mobile app development is fundamentally constrained by the need for labor-intensive, manually generated properties. The paper "From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing" (2604.13463) introduces PROPGEN, an end-to-end system leveraging LLMs for automated property construction, thereby enabling scalable, specification-driven testing of Android applications.

Technical Approach

PROPGEN synthesizes executable properties for property-based testing via three sequential stages:

Functionality-Guided Exploration and Behavioral Evidence Construction

PROPGEN performs hypothesis-driven exploration by systematically inferring actionable functionalities from GUI states, grounding them in explicit UI widget references. Multimodal LLMs are triggered only when new UI contexts are encountered to infer functionality hypotheses, minimizing token consumption and redundancy. Targeted execution of these functionalities prioritizes main app behaviors, uniqueness, and contextual feasibility. Each execution trace is incrementally summarized into structured behavioral evidence containing state snapshots, interaction descriptors, state diffs, and outcome labels. This behavioral abstraction forms the basis for downstream property synthesis.

Property Synthesis from Behavioral Evidence

From summarized traces, PROPGEN elicits natural-language property specifications reflecting generalized (precondition, interaction, postcondition) relations. Critically, preconditions are grounded in observable UI evidence; interactions capture user event sequences; postconditions assert GUI effects verifiable across runtime states while filtering brittle, trace-specific details. The natural language specifications are then translated into executable properties compatible with property-based testing frameworks such as KEA, operationalizing behavioral rules for automated testing.

Recognizing that properties inferred from single traces are susceptible to overgeneralization or excessive specificity, PROPGEN incorporates feedback-driven refinement. Upon test-triggered false positives, the system diagnoses which property component (precondition, interaction, postcondition) is at fault by comparing the original behavioral evidence against failure traces. Targeted relaxations or strengthenings are applied to assure property robustness across legitimate executions without deviation from original functional intents.

Empirical Evaluation and Numerical Results

PROPGEN was evaluated on 12 diverse, popular Android apps, selected for both subject diversity and legacy inclusion in prior functional bug studies. Using GPT-5.2 as the backend LLM and KEA for property execution, PROPGEN achieved:

Functionality Coverage: Inferred 1,282 functionalities, with 1,210 (94.4%) validated and 977 (76.2%) correctly executed, outperforming DroidAgent which had 575 inferred, 491 valid, and only 187 correctly executed functionalities.
Property Generation: Generated 985 properties, 912 (92.6%) validated as semantically correct by expert annotators. Most properties exceeded 90% validity.
Refinement Effectiveness: Of 127 properties producing spurious violations during testing, 118 (92.9%) were successfully refined. Most modifications concerned preconditions or postconditions, not interaction sequences.
Bug Detection: Uncovered 25 unique previously undocumented functional bugs—only 3 (12%) were found by prior methods (Genie, Odin, PBFDroid, VisionDroid).
LLM Cost: Each behavioral evidence construction and property synthesis call averaged 6,198 tokens / $0.0145; total cost per app averaged$14.86. Refinement required on average 8,768 tokens / $0.02 per property.

Comparative Analysis and Claims

PROPGEN’s hypothesis-driven, widget-grounded exploration yields superior coverage and execution correctness compared to prior agent-based systems. The structured behavioral summarization ensures high property fidelity, as opposed to the loosely associated functionality hypotheses of baseline approaches. Strong numerical results substantiate the claim that automated property generation can substantially scale property-based testing, identifying bugs beyond the coverage of existing automated and semi-automated testing methods.

The property refinement mechanism demonstrates that nearly all false positives arising from LLM-inferred properties can be mitigated through minimal, evidence-driven adjustments, highlighting the practical viability of automated specification construction.

PROPGEN's bug-finding capability is complementary to established functional testing tools: most bugs uncovered are outside their detection scope, attributable to a broader and more context-sensitive exploration enabled by generalized properties.

Implications and Future Directions

From a practical standpoint, PROPGEN reduces the human effort required for specification construction, promoting property-based testing adoption in continuous integration workflows. Theoretical implications extend to automated behavioral specification synthesis: the use of multimodal LLMs for contextual understanding, coupled with structured evidence abstraction, supports a scalable framework for property generation beyond mobile apps—potentially applicable to web GUI testing and automated documentation generation.

Future developments may focus on incremental learning across app versions, reinforcement-driven exploration, integration of richer multimodal evidence, and extending property synthesis to cross-app or multi-platform behaviors. Challenges remain in scaling to highly dynamic or deeply nested UI contexts and minimizing LLM hallucination in non-standard GUIs.

Conclusion

PROPGEN expands the practical reach of property-based testing for mobile apps by automating the exploration, specification, and refinement of behavioral properties, validated through strong empirical performance and robustness to false positives (2604.13463). This LLM-based pipeline provides a scalable framework for specification-driven testing, capable of uncovering complex functional bugs missed by traditional approaches and poised to shape future methodologies in automated software verification.

Markdown Report Issue