Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent (2505.16827v1)

Published 22 May 2025 in cs.AI

Abstract: GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at https://github.com/JiuTian-VL/GUI-explorer.

Summary

Overview of GUI-Explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agents

The paper introduces GUI-explorer, a training-free graphical user interface (GUI) agent designed to overcome the inherent challenges faced by multimodal LLMs (MLLMs) in dynamic environments. These challenges include the inaccurate interpretation of user interface (UI) components and obsolete knowledge due to frequent application updates. The proposed framework eliminates the need for parameter updates when addressing new applications, making it both systematic and efficient in exploring and interacting with GUIs.

Key Mechanisms

GUI-explorer integrates two essential mechanisms aimed at improving the capabilities of GUI agents:

  1. Autonomous Exploration of Function-aware Trajectory: This component utilizes a Function-aware Task Goal Generator to create exploration goals, derived from GUI structural information, such as screenshots and activity hierarchies. These exploration goals allow for a thorough and systematic traversal of the application functionalities, leading to the collection of diverse trajectories. The trajectory collection is critical for the agent's understanding and interaction with various UI elements.
  2. Unsupervised Mining of Transition-aware Knowledge: The second component involves a Transition-aware Knowledge Extractor, which performs unsupervised analysis of state transitions based on structured interaction triples (observation, action, outcome). This process allows for precise extraction of screen-operation logic, thus enabling the agent to comprehend and navigate within the applications effectively without human intervention.

Results and Evaluation

The paper presents promising numerical results, demonstrating the effectiveness of GUI-explorer:

  • A task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, indicating significant improvements over state-of-the-art (SOTA) agents.
  • GUI-explorer enhances task performance by up to 11.7% compared to previous methods, showcasing its efficacy in diverse scenarios.

Additionally, the paper introduces a benchmark (GUI-KRB) to evaluate MLLMs' GUI understanding through 500 curated samples across 43 applications, revealing critical limitations in current models with error rates of 15.2% to 22.8%. GUI-explorer addresses these limitations, reducing prior knowledge inaccuracies by 16.0% and improving dynamic comprehension tasks.

Implications and Future Directions

The practical implications of this research are significant, as it demonstrates how GUI agents can adapt to dynamic environments with minimal human oversight. The autonomous exploration and knowledge extraction mechanisms offer a cost-effective and scalable solution for GUI automation, which is particularly advantageous given the rapid iteration cycles of mobile applications.

Theoretical implications involve expanding the understanding of UI interactions by leveraging comprehensive trajectories and state transitions. GUI-explorer's approach paves the way for further developments in AI's ability to autonomously interact with digital interfaces.

Future research may explore extending these mechanisms to web and desktop applications, given the preliminary success observed in mobile environments. Investigations could also focus on enhancing the efficiency of the Transition-aware Knowledge Extractor and optimizing exploration techniques to further reduce computational overhead.

In conclusion, GUI-explorer represents a substantial step forward in the field of GUI automation, offering a robust framework that improves upon current models by addressing their fundamental limitations and demonstrating significant enhancements in task completion rates.