Automated Functional Testing for Malleable Mobile Application Driven from User Intent

Published 2 Apr 2026 in cs.SE | (2604.02079v2)

Abstract: Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose ALADDIN, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that ALADDIN effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel framework, Aladdin, that automates functional testing for per-user malleable mobile apps using LLM-guided UI state navigation and dynamic assertion generation.
It efficiently tackles the challenge of navigating stateful UI environments and validating dynamic GUI modifications based on natural language user requirements.
Experimental results show significant improvements in test success rate, step efficiency, precision, and overall robustness compared to traditional testing methods.

Automated Functional Testing for Per-User Malleable Mobile Applications

Motivation and Context

Malleable software aims to support adaptation, configuration, and modification, even post-deployment, directly accommodating end-user requirements. Traditional application development enforces a top-down, developer-centric workflow, relying on product managers' prioritization for feature introduction and limiting per-user customization predominantly to configurable settings. This paradigm fails to address the "long tail" of user requirements, often leaving niche but important feature requests unfulfilled.

Recent advances in LLM-driven code generation have upended this landscape, making it feasible to envision a shift to end-user-driven application customization, especially in the mobile domain. However, this raises pressing requirements for robust, automated functional testing pipelines that can keep pace with high-volume, per-user feature augmentation, all without direct source code access granted to end users (Figure 1).

Figure 1: Motivation scenario and conceptual model for per-user malleable mobile application releases, preserving vendor IP while enabling fine-grained user customization.

Defining the Testing Problem

The critical challenge lies not only in automating reliable code generation for user-specified features but in fully validating their presence and correctness at scale. This problem has several dimensions:

Efficiently navigating stateful UI environments to reach feature-specific triggers based on natural language user requirements.
Validating the existence and correct triggering of the newly introduced GUI features.
Automating correctness oracles for high-level functional verification, particularly over semantic UI behaviors rather than explicit code or variable states.

Existing LLM-based test generation is largely confined to source-available, UI-free scenarios, while GUI testing relies on brittle heuristics or fixed exploratory policies that do not accommodate user-driven adaptation. Similarly, oracle generation methods in the literature focus on low-level property assertions (e.g., variables, logs), insufficient for GUI-centric requirements.

Aladdin Framework

The proposed framework, Aladdin, integrates natural language understanding, LLM-guided exploration, and dynamic assertion generation into a unified, requirement-driven GUI test generation pipeline (Figure 2).

Figure 2: Overview of the Aladdin architecture, highlighting iterative LLM-guided exploration, action script generation, and oracle construction.

Given a requirement $r$ , Aladdin initiates UI exploration from the application's initial state, incrementally executing ranked interaction actions. Each GUI state and operation is evaluated for semantic relevance to $r$ , using an LLM-based scoring mechanism rather than traditional search or replay policies. The process leverages compressed DOM representations, LLM prompt engineering, and geometric mean relevance aggregation to efficiently guide the search toward potential trigger UI states.

Notably, Aladdin employs a hybrid of literal and semantic equivalence guidelines, enabling it to prioritize path exploration by both high-confidence literal matches (e.g., an entry labeled "Language Settings" for a language-support requirement) and looser semantic containment for nuanced requests.

Figure 3: Example of an LLM prompt format for trigger state judgement, compressing DOM attributes to minimize attention dispersion.

Phase 2: Functional Presence Check and Script Generation

Upon reaching a candidate trigger state, Aladdin constructs and runs a script to interact with all required UI entities, followed by the programmatic generation and evaluation of assertions that verify the presence of all necessary elements as specified in the requirement. All interaction and assertion code is synthesized via LLMs, using role-based prompting and domain-specific execution patterns (e.g., uiautomator2).

During this phase, dynamic refinement is applied to all UI selectors to mitigate hallucination and ambiguity from the LLM, ensuring robust matching between the test code and the live GUI.

Phase 3: Functional Correctness Oracle Generation

Having activated the target feature, Aladdin automatically identifies pre- and post-execution states, then synthesizes oracles as collections of assertions designed to capture all possible correctness aspects of the augmented functionality. This includes comparison-based oracles (e.g., change in displayed value/language/option availability) and programmatic iteration over candidate UI states when necessary.

The framework is robust to partial or faulty code augmentation, generating multi-faceted oracles as needed to capture incomplete or subtle behaviors. Selector refinement ensures the reliability of all UI element references in generated oracles.

Figure 4: Illustrative example of Aladdin synthesizing a test for a user-specified Bengali language support requirement, from navigation through assertion construction and oracle validation.

Benchmark and Experimental Results

To systematically evaluate Aladdin, the authors curated a benchmark of 144 application versions across six mobile apps, each subjected to 34 diverse, real-world functional requirements sourced from Google Play reviews. The benchmark includes both correct (81.2% end-to-end Aladdin success rate) and faulty (precision 90.3%, specificity 89.1%) implementations, allowing for rigorous discrimination capability assessment.

Aladdin was compared to stage-wise baselines—such as prompt-only navigation and AugmenTest for assertion/oracle generation—demonstrating consistently superior success rates, step efficiency, precision, recall, and specificity. In ablation settings, Aladdin exhibited substantial gains in both navigation efficacy (1.54 vs. 1.98 average steps for trigger access) and oracle discrimination.

The framework displayed cross-domain robustness, with recall and specificity sustained across general and domain-specific requirement types (e.g., Theme, Navigation, Health). Weaknesses were noted in categories with inherently ambiguous requirements (e.g., Creation, Notification), reflecting ongoing challenges in intent interpretation and oracle specification.

Practical and Theoretical Implications

Practically, Aladdin operationalizes scalable, user-driven mobile app customization, reducing the testing bottleneck and negating the need for source access or developer intervention. This could substantially decrease mean-time-to-feature for niche user requests, enabling a new class of "malleable mobile apps" that preserve vendor IP control but deliver per-user binary releases.

On a theoretical level, the work operationalizes the concept of requirement-driven, semantically flexible UI test generation, integrating high-level NLP with practical system engineering. The demonstrated improvement in GUI navigation, interaction synthesis, and correctness oracle generation provides a template for the next generation of test frameworks suitable for the LLM-driven code generation era.

Prospects for Future Developments

Future AI research could address further sources of ambiguity in user requirements, leveraging more advanced LLMs for intent disambiguation and incorporating user feedback loops for oracle refinement. Extending Aladdin to other platforms (iOS, web) and integrating property-based or non-functional testing (e.g., security, performance) remain open challenges.

There is also substantial potential for integrating offline and online reinforcement learning, enabling the agent to self-improve oracle generation and state navigation policies based on empirical correctness data over time.

Conclusion

Aladdin provides a concrete methodology for automated, requirement-driven GUI testing in a context of per-user malleable mobile apps, bridging the gap between free-form end-user intent and robust, vendor-controlled software delivery. Experimental analysis validates its efficacy and generalization, marking a notable advance toward practical, LLM-enabled software customization ecosystems.

Markdown Report Issue