Mapping Natural Language Instructions to Mobile UI Action Sequences (2005.03776v2)

Published 7 May 2020 in cs.CL and cs.LG

Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PIXELHELP.

Authors (5)

Yang Li (1142 papers)
Jiacong He (1 paper)
Xin Zhou (319 papers)
Yuan Zhang (331 papers)
Jason Baldridge (45 papers)

Citations (163)

View on Semantic Scholar

Summary

Mapping Natural Language Instructions to Mobile UI Action Sequences: An In-Depth Analysis

The paper "Mapping Natural Language Instructions to Mobile UI Action Sequences," authored by researchers from Google Research, presents a structured approach to connecting language instructions with actions on mobile interfaces. This exploration into language grounding proposes a multistep model that divides the challenge into two primary tasks: phrase tuple extraction and action grounding. The solution pivots on the implementation of Transformer-based models to handle the complexities of interpreting language and mapping it to executable UI sequences.

Key Contributions and Methodology

The paper introduces a novel problem statement that focuses on accurately translating natural language instructions into action sequences on mobile user interfaces (UIs). The contribution lies in addressing this at scale without necessitating laborious human annotation of instruction-action data pairs. In support of this objective, the authors present three new datasets:

PixelHelp: Provides instructions paired with action sequences on a mobile UI emulator.
AndroidHowTo: Comprises English 'How-To' web-sourced instructions with annotated action phrases, aiding in phrase extraction model training.
RicoSCA: Contains synthetic command-action pairings derived from a large corpus of UI screens to train the grounding model.

The endeavor to automate this translation involves two distinct yet interconnected components:

Phrase Tuple Extraction Model: Operationalized through a Transformer architecture, this model identifies the critical tuples within instructions that specify UI actions. Three span representations were evaluated, with sum pooling yielding notable efficacy (85.56% complete match accuracy in testing).
Grounding Model: Once tuples are extracted, this model connects them to the executable actions on a given screen. The model leverages the contextual representation of UI objects to enhance action prediction accuracy. With a complete match accuracy of 70.59% on the PixelHelp dataset, the Transformer-based grounding approach surpasses alternative baselines, demonstrating its robustness.

Results and Discussion

The segmentation of language processing and action grounding is significant for performance improvement, as it allows flexibility in handling diverse instructions and complex UI layouts. While heuristic and GCN-based approaches were tested, the Transformer models consistently outperformed these baselines, suggesting that the attentive capabilities of Transformers better capture the intricate relationships inherent in language and UI interactions.

The authors also conducted an interesting analysis correlating spatial location descriptors in language with UI object positions. The results validate that certain linguistic cues align with expected screen areas, although variations exist due to the flexibility of natural language.

Implications and Future Directions

The findings significantly impact accessibility technologies, autonomous UI testing, and personal digital assistant capabilities, where navigation of mobile interfaces through spoken or typed instructions could transform the user experience. Furthermore, this research lays foundational work for future explorations into language grounding and UI automation.

Possible expansions of this research include refining grounding models to mitigate errors cascading from tuple extraction, and exploring reinforcement learning to enhance adaptability to new applications and varied instruction styles. With these advances, further integration of visual data could lead to even more intuitive action prediction systems.

Conclusion

This work marks a substantial step forward in the field of UI action automation from natural language instructions, providing a basis for future advancements in AI-driven mobile interface management. By addressing the challenges of phrase extraction and contextual grounding, this research invites continued exploration into innovative architectures and broader datasets for comprehensive language grounding solutions.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos