AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

Published 25 Sep 2024 in cs.AI | (2409.17140v2)

Abstract: Multimodal LLMs (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework that prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Microsoft Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compared to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and explores a fresh UI design principle for application providers to turn applications into agents in the era of LLMs, paving the way towards an agent-centric operating system (Agent OS).

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces AXIS, an API-first framework that converts applications into efficient agents while reducing latency compared to traditional UI interactions.
It demonstrates significant empirical improvements, cutting task completion time by 65%-70% and cognitive workload by 38%-53% with 97%-98% accuracy.
The framework’s self-explorative 'skill' mechanism enables agents to autonomously generate APIs, paving the way for an agent-centric operating system.

Essay: Transforming Application Interaction through AXIS

The paper "Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents" introduces a novel framework known as AXIS (Agent eXploring API for Skill integration). This framework represents a significant advancement in human-agent-computer interaction (HACI) by leveraging API-first approaches, prioritizing efficiencies in task completions via LLMs.

The core problem addressed by this research is the high latency and unreliability faced by LLM-based agents due to extensive sequential user interface (UI) interactions. Traditional multimodal LLMs (MLLMs), while enhancing agent performance, often suffer from these inefficiencies. AXIS proposes to circumvent these constraints by prioritizing application programming interfaces (APIs) over UI actions, thereby expediting task completion processes.

A fundamental element of AXIS is its capacity for self-exploration. The framework autonomously investigates application environments, learning and constructing new APIs to facilitate efficient interaction. This self-explorative aspect is manifest in what the authors refer to as "skills," high-level representations that empower API-first LLM-based agents, allowing significant reductions in task execution durations and cognitive loads for users.

The authors detail substantial empirical findings, demonstrating that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53% with an accuracy of 97\%-98% compared to human performance, as observed in testing with Office Word tasks. These results underline the framework's efficiency, showcasing its potential to redefine human-computer interaction paradigms. Furthermore, AXIS allows for the conversion of every application into an agent via its structured API-driven approach, hinting at an evolution towards an "Agent OS" where user operations are predominantly mediated through intuitive, natural language interfaces.

From a technical perspective, AXIS utilizes a structured approach to skill exploration within applications. It includes varied entities such as state interfaces and skill executors, which collectively facilitate interaction between agents and the environment. These agents not only observe but also interact, learning and generating new interaction skills that could eventually replace unnecessary UI sequences with single API calls.

The implications for the field of AI and computer science are considerable. By shifting priority towards API-based interactions, AXIS contributes to reducing perceived complexity and cognitive strain in using software applications. There is a clear trajectory for practical application: as developers integrate AXIS-like frameworks, software design may shift to minimize UI complexity, focusing instead on robust API capabilities.

Looking further ahead, AXIS's approach has the potential of extending beyond individual applications. Its generalized API-first design principles suggest adaptability across diverse digital environments, encouraging a potential universal shift in the interface development paradigm. This shift will likely inspire future research into integrating and expanding upon such frameworks, refining how we understand and implement efficient human-computer interactions empowered by sophisticated AI agents.

In conclusion, AXIS presents a compelling vision for future software interactions, characterized by reduced complexity and increased efficiency. By addressing fundamental inefficiencies of existing LLM-based UI agents, AXIS sets a new standard that other applications and systems can aspire to, paving the way for a new era of agent-centric operating systems. While the practical applications of this research are still burgeoning, the foundation laid by AXIS is a substantial step towards a more integrated, efficient, and user-centered digital future.