Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enabling Conversational Interaction with Mobile UI using Large Language Models

Published 18 Sep 2022 in cs.HC and cs.AI | (2209.08655v2)

Abstract: Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained LLMs have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction.

Citations (101)

Summary

  • The paper presents a novel approach leveraging LLMs to facilitate multi-scenario conversational interactions in mobile UIs.
  • It converts Android UI view hierarchies to HTML and uses exemplars with chain-of-thought prompting for effective task performance.
  • Evaluations show improved grammar correctness, screen summarization accuracy, and superior question-answering compared to baseline models.

Enabling Conversational Interaction with Mobile UI using LLMs

This essay discusses the application of LLMs to facilitate conversational interactions in mobile user interfaces (UIs), presenting an approach that bypasses the need for extensive datasets and specialized models for each task. By leveraging LLMs, a common model can adapt to various tasks through effective prompting techniques.

Introduction

Conversational interactions with mobile devices, mediated through natural language, offer significant user advantages, potentially transforming UI manipulation for accessibility and multimodality. Existing solutions often require extensive resources for dataset creation and model training tailored to specific tasks. This work explores how LLMs, well-established in their ability to generalize across different tasks, can be adapted for mobile UIs with minimal task-specific input.

Approach

Conversation Scenarios

The paper defines four key conversational interaction scenarios between users and agents, each related to UI task types:

  • Screen Question-Generation: The agent frames questions based on UI elements requiring input.
  • Screen Summarization: The agent provides a concise summary of the UI's function.
  • Screen Question-Answering (QA): The agent answers user queries by extracting information from the UI.
  • Mapping Instruction to UI Action: The agent interprets instructions to trigger appropriate UI elements. Figure 1

    Figure 1: The categorization of different conversation scenarios when user and agent interact to complete tasks on mobile UIs.

Prompts and Screen Representation

To enable LLMs to process mobile UIs, the paper introduces techniques for converting Android UI view hierarchies into HTML syntax, leveraging the HTML structure to align with LLM training data characteristics. The prompts are augmented with exemplars, containing both inputs (screen HTML) and expected outputs (questions, summaries, answers, or actions), with Chain-of-Thought prompting used for complex reasoning. Figure 2

Figure 2: Illustration of the proposed prompt structure and its application in mobile UI tasks.

Task Implementation and Evaluation

Screen Question-Generation

The goal is to generate relevant questions for identified input fields. Experiments with LLMs achieved high grammar correctness and UI element relevance, outperforming a template-based baseline. Figure 3

Figure 3: Example screen questions generated by the LLM, demonstrating the utilization of screen contexts for each UI element.

Screen Summarization

LLMs were tasked with summarizing screen functionalities. Human evaluations indicated that LLM-generated summaries were more accurate than those from benchmark models, despite automatic metric scores being slightly lower, showcasing the potential for LLM prior knowledge integration. Figure 4

Figure 4: Annotator vote distribution across all test screens, demonstrating higher perceived accuracy for LLM summaries.

Screen Question-Answering

In this task, LLMs significantly outperformed a pre-trained text QA model by understanding screen contexts and extracting accurate information based on questions. Figure 5

Figure 5: Example results from the screen QA experiment showing the performance superiority of LLMs over baseline models.

Mapping Instruction to UI Action

The LLMs' ability to map language commands to UI actions was evaluated. While the LLMs underperformed relative to dedicated models trained on large datasets, their performance was competitive, achieving reasonable accuracy with minimal examples.

Future Directions

The results suggest promising avenues for expanding conversational interactions, integrating multi-modal elements, and addressing limitations such as leveraging image data and multi-turn conversations. Future work may also focus on enhancing LLM steerability and refining prompt strategies to handle more complex scenarios.

Conclusion

The study demonstrates that LLMs, through strategic prompting and minimal exemplars, can effectively perform a variety of tasks within mobile UIs. This approach not only enables practical applications in HCI but also reduces the need for labor-intensive data preparation, paving the way for more accessible and efficient interface design and testing methodologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.