Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System (2506.08972v1)

Published 10 Jun 2025 in cs.CL

Abstract: Autonomous agents powered by multimodal LLMs have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks -- such as shot-chain execution tasks and single-screen grounding tasks -- while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.

Summary

The paper introduces Agent-Nexus, a novel scheduling system that decomposes complex tasks into atomic subtasks, yielding 24% to 40% performance improvements.
It presents the UI-Nexus benchmark, which rigorously evaluates mobile agents on compositional tasks across 50 applications in English and Chinese.
The research highlights the potential of modular scheduling to advance hierarchical planning and enhance the real-world autonomy of mobile agents.

Atomic-to-Compositional Generalization for Mobile Agents

The paper "Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System" addresses the challenge of enabling mobile agents, powered by multimodal LLMs (MLLMs), to effectively transition from executing atomic tasks to more complex compositional tasks. Despite advancements in MLLMs facilitating autonomous operation on intelligent devices, agents often struggle with tasks requiring a combination of simpler operations—a necessity for real-world applications.

Benchmark: UI-Nexus

The authors introduce UI-Nexus, a comprehensive benchmark designed to evaluate mobile agents' performance across various compositional task types categorized into Simple Concatenation, Context Transition, and Deep Dive. UI-Nexus encompasses an extensive set of 100 interactive task templates distributed across 50 applications, spanning both English and Chinese languages. The benchmark presents significant challenges and evaluates the agent's abilities in balancing performance and efficiency, highlighting failure modes such as under-execution and attention drift.

Agent-Nexus Scheduling System

The paper proposes Agent-Nexus, a lightweight scheduling system aimed at improving mobile agents' handling of such compositional tasks. Agent-Nexus dynamically breaks down complex tasks into self-contained atomic subtasks, thus leveraging existing agents' capabilities while achieving substantial success rate improvements (between 24% to 40% within the UI-Nexus benchmark).

Experimental Results

The efficacy of Agent-Nexus was demonstrated through experiments revealing substantial challenges in compositional operations for existing mobile agents. Specifically, agents typically struggled with balancing performance and were hampered by access restrictions to internal mobile app states. Agent-Nexus, however, significantly narrowed this atomic-to-compositional generalization gap, underscoring the advantages of model-driven decomposition of task sequences into subtasks that mobile agents can operate on.

Implications and Future Directions

The paper suggests practical implications for improving device autonomy, specifically regarding the integration of cognitive workflows into mobile functionality. It highlights theoretical implications for advancing hierarchical planning and reasoning in automated systems. Future developments in AI could benefit from incorporating modular scheduling strategies similar to Agent-Nexus, potentially refining mobile agents' operational capabilities in more dynamic environments.

Overall, the research emphasizes the importance of bridging the gap between atomic and compositional tasks in MLLM-powered agents and proposes promising methods and benchmarks for systematically evaluating and addressing these challenges. Researchers looking to develop robust mobile agents will find valuable insights in the proposed benchmark and scheduling system, paving the way for further advancements in AI autonomy and efficiency on mobile platforms.