Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enabling Conversational Interaction with Mobile UI using Large Language Models (2209.08655v2)

Published 18 Sep 2022 in cs.HC and cs.AI

Abstract: Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained LLMs have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction.

Enabling Conversational Interaction with Mobile UI using LLMs

This paper explores the potential of integrating LLMs into mobile User Interfaces (UIs) to facilitate conversational interactions. The work investigates the feasibility of using a single pre-trained LLM to generalize across various tasks within the UI, promoting a streamlined approach that negates the necessity for task-specific models and datasets.

Core Contributions and Approach

The authors present a novel method of leveraging the few-shot learning capabilities of LLMs to interact with mobile UIs. They propose a process of converting a UI's view hierarchy into a text-based representation using HTML syntax. This conversion implies that LLMs, which are generally trained on a mixture of natural languages and code, can leverage their existing model capabilities for interaction tasks.

The paper categorizes conversational interactions into four primary scenarios based on initiative (user or agent) and purpose (providing or soliciting information):

  1. Agent Solicits Information: Exemplified by the task of Screen Question-Generation, where the agent autonomously crafts questions for input fields on the UI.
  2. Agent Provides Information: Represented by Screen Summarization, where the agent delivers a succinct overview of the UI's purpose.
  3. User Solicits Information: Screen Question-Answering demonstrates this category, where the agent responds to user inquiries about the UI.
  4. User Provides Information: Illustrated by Mapping Instruction to UI Action, where language commands translate into specific UI actions.

The authors employ Chain-of-Thought prompting, a technique that facilitates reasoning by generating intermediate steps before providing a final output. This method aims to leverage the semantic understanding of LLMs to enhance interaction quality.

Experimental Findings

Comprehensive experiments were conducted using open-source Android UI datasets, such as RICO and PixelHelp. The paper evaluates how the approach performs on each of the identified conversational scenarios.

  • Screen Question-Generation: Achieved near-perfect grammatical correctness in generating questions, significantly outperforming template-based approaches.
  • Screen Summarization: While automatic metrics rated LLM performance lower due to the comparison model’s training on the specific task dataset, human evaluators found the LLM-generated summaries more accurate.
  • Screen QA: The LLM demonstrated strong performance, particularly when using few-shot examples, indicating a superior understanding of UI contexts compared to an off-the-shelf model like DistilBERT.
  • Mapping Instruction to UI Action: Although it underperformed compared to a model specifically trained on the task, the LLM demonstrated considerable potential by achieving competitive metrics with only minimal task-specific examples.

Implications

This research suggests that LLMs are sufficiently competent to support prototypical development of conversational interactions with mobile UIs. This approach can significantly reduce time and effort for interaction designers and developers by eliminating the need for massive annotated datasets or costly model training.

The implications extend to accessibility, suggesting that LLMs can enhance interface usability for users requiring non-visual interaction due to situational or permanent impairments. Moreover, this work sets a precedent for the broader applicability of LLMs beyond traditional language tasks, forging pathways to potentially integrate LLMs into various interactive systems, including web and desktop environments.

Speculation on Future AI Developments

Looking forward, advancements in LLM capabilities could incrementally include more nuanced interactions, particularly for complex multi-turn or multi-screen dialogue scenarios. Fine-tuning such models or integrating multi-modal inputs, including visual data along with textual cues, might further refine their utility in dynamic interface environments. This research lays foundational insights that could precipitate a shift towards increasingly intelligent and adaptive UI experiences driven by AI.

In summary, this paper effectively underscores the potential of LLMs to harmonize natural language processing with UI interaction, offering a scalable solution to enhance human-computer interaction via conversational agents on mobile platforms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Bryan Wang (25 papers)
  2. Gang Li (579 papers)
  3. Yang Li (1140 papers)
Citations (101)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com