Papers
Topics
Authors
Recent
2000 character limit reached

Mapping Natural Language Instructions to Mobile UI Action Sequences

Published 7 May 2020 in cs.CL and cs.LG | (2005.03776v2)

Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PIXELHELP.

Citations (163)

Summary

  • The paper introduces a novel problem to translate natural language instructions into mobile UI action sequences using a two-step Transformer-based approach.
  • It employs a phrase tuple extraction model achieving 85.56% match accuracy and a grounding model with 70.59% accuracy on the PixelHelp dataset.
  • The research advances UI automation, impacting accessibility, autonomous testing, and digital assistants through improved language grounding.

Mapping Natural Language Instructions to Mobile UI Action Sequences: An In-Depth Analysis

The paper "Mapping Natural Language Instructions to Mobile UI Action Sequences," authored by researchers from Google Research, presents a structured approach to connecting language instructions with actions on mobile interfaces. This exploration into language grounding proposes a multistep model that divides the challenge into two primary tasks: phrase tuple extraction and action grounding. The solution pivots on the implementation of Transformer-based models to handle the complexities of interpreting language and mapping it to executable UI sequences.

Key Contributions and Methodology

The paper introduces a novel problem statement that focuses on accurately translating natural language instructions into action sequences on mobile user interfaces (UIs). The contribution lies in addressing this at scale without necessitating laborious human annotation of instruction-action data pairs. In support of this objective, the authors present three new datasets:

  1. PixelHelp: Provides instructions paired with action sequences on a mobile UI emulator.
  2. AndroidHowTo: Comprises English 'How-To' web-sourced instructions with annotated action phrases, aiding in phrase extraction model training.
  3. RicoSCA: Contains synthetic command-action pairings derived from a large corpus of UI screens to train the grounding model.

The endeavor to automate this translation involves two distinct yet interconnected components:

  • Phrase Tuple Extraction Model: Operationalized through a Transformer architecture, this model identifies the critical tuples within instructions that specify UI actions. Three span representations were evaluated, with sum pooling yielding notable efficacy (85.56% complete match accuracy in testing).
  • Grounding Model: Once tuples are extracted, this model connects them to the executable actions on a given screen. The model leverages the contextual representation of UI objects to enhance action prediction accuracy. With a complete match accuracy of 70.59% on the PixelHelp dataset, the Transformer-based grounding approach surpasses alternative baselines, demonstrating its robustness.

Results and Discussion

The segmentation of language processing and action grounding is significant for performance improvement, as it allows flexibility in handling diverse instructions and complex UI layouts. While heuristic and GCN-based approaches were tested, the Transformer models consistently outperformed these baselines, suggesting that the attentive capabilities of Transformers better capture the intricate relationships inherent in language and UI interactions.

The authors also conducted an interesting analysis correlating spatial location descriptors in language with UI object positions. The results validate that certain linguistic cues align with expected screen areas, although variations exist due to the flexibility of natural language.

Implications and Future Directions

The findings significantly impact accessibility technologies, autonomous UI testing, and personal digital assistant capabilities, where navigation of mobile interfaces through spoken or typed instructions could transform the user experience. Furthermore, this research lays foundational work for future explorations into language grounding and UI automation.

Possible expansions of this research include refining grounding models to mitigate errors cascading from tuple extraction, and exploring reinforcement learning to enhance adaptability to new applications and varied instruction styles. With these advances, further integration of visual data could lead to even more intuitive action prediction systems.

Conclusion

This work marks a substantial step forward in the field of UI action automation from natural language instructions, providing a basis for future advancements in AI-driven mobile interface management. By addressing the challenges of phrase extraction and contextual grounding, this research invites continued exploration into innovative architectures and broader datasets for comprehensive language grounding solutions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.