AppSelectBench: Application Selection Benchmark
- AppSelectBench is a large-scale benchmark that evaluates language-driven CUAs’ capacity to choose the correct desktop application given a natural language user intent.
- It leverages a dataset of over 100,000 tasks across 100 popular applications and employs methods such as zero-shot, few-shot, and retrieval-augmented selection.
- The benchmark reveals performance variations across app categories, highlighting the need for hierarchical reasoning and modular strategies in future agent designs.
AppSelectBench is a large-scale, application-level benchmark designed to evaluate the capacity of language-driven Computer-Using Agents (CUAs) to select the correct desktop application given a natural-language user goal, prior to any invocation of fine-grained APIs. It systematically addresses the gap in tool-use evaluation frameworks left by benchmarks that exclusively assess intra-application API selection, without considering the fundamental decision of inter-application choice. AppSelectBench features a dataset of over 100,000 tasks spanning 100 widely-used desktop applications and a unified set of evaluation protocols, exposing current limitations and prospects in application-level reasoning for intelligent agents (Chen et al., 25 Nov 2025).
1. Motivation and Problem Statement
Application selection constitutes a fundamental step in human and agent workflows, determining which software environment should be initialized before granular tool usage. This operation is necessary to prevent orchestration errors, restrict irrelevant tool access, and optimize context relevance. Pre-existing benchmarks such as API-Bank and ToolBench freeze the application layer, evaluating only how agents invoke APIs within a single, preselected application. AppSelectBench reformulates the selection step: given a natural-language user intent and a candidate set of desktop applications , the benchmark defines the task as learning a function
where is a directed graph over encoding temporal or logical dependencies. In the studied singleton regime, this reduces to selecting a single correct application (or an interchangeable subset) matching the user's high-level intent.
2. Dataset Construction and Task Generation Pipeline
AppSelectBench’s dataset comprises more than 100,000 realistic and semantically grounded user instructions distributed across 100 desktop applications. The creation process follows a four-stage pipeline:
- Atomic Task Curation: Approximately 3,000 atomic operations are drafted using GPT models and human refinement, each defined with explicit argument schemas (e.g., Excel.SUM(column), PowerPoint.CreateFromTemplate).
- Workflow Composition: A composition engine generates multi-step workflows by sampling and chaining primitives under logical, temporal, and functional constraints (e.g., Excel.OpenFile → Excel.AddChart → PowerPoint.InsertChart).
- Argument Generation: Abstract arguments are instantiated through rule-based templates, probabilistic samplers, or small generative models, grounding each step with realistic parameters (e.g., generate_city_name(), generate_random_number()).
- Instruction Narration: Stepwise narrations are assembled, with selective omission ("step-wise dropout") to mimic natural conciseness, followed by paraphrasing with LLMs for fluency.
A 10% stratified evaluation yielded high human ratings: grammatical naturalness (4.7/5), semantic realism (4.6/5), and ground-truth correctness (99.8%), providing strong evidence for dataset quality.
3. Benchmark Scope and Application Coverage
AppSelectBench tasks span 100 applications categorized into 12 high-level domains, including browsers and search, office and knowledge work, communication, developer and sysadmin tools, creative and content production, music and media players, streaming and social video, gaming utilities, system utilities, and AI assistants. Each application contains on average approximately 1,000 user tasks. Example applications and categories include:
| Category | Example Applications |
|---|---|
| Office & Knowledge Work | Word, Excel, PowerPoint, OneNote |
| Developer & Sysadmin Tools | VS Code, RStudio, MATLAB, PowerShell |
| Creative & Content Production | Photoshop, Blender, CapCut |
| Communication & Collaboration | Teams, Slack, Zoom |
| Streaming & Social Video | YouTube, Netflix, TikTok |
| Gaming & Game Utilities | Steam, Solitaire |
| Windows Core Apps | File Explorer, Settings, Disk Cleanup |
| AI Assistants | M365 Copilot, Microsoft Copilot |
Tasks range in specificity and complexity, from "Calculate the total sales by region" (Excel) to "Search for 'Shape of You' and play it" (Spotify).
4. Evaluation Regimes and Metrics
AppSelectBench evaluates application selection using five principal regimes:
- Random Selector: Uniform random choice (empirical lower bound ≈1.6% accuracy).
- Rule-based Heuristic: Keyword-lexicon matching between user query and application names/functions.
- Zero-Shot Prompting: Only the user task is provided; the model predicts the application.
- Few-Shot Prompting: The prompt is prepended with 3–5 exemplars of (Task→Application) pairs.
- Retrieval-Augmented Selection (RAS): The model is supplied with structured textual capability descriptions of all candidate applications, retrieved from an external knowledge base.
Performance is assessed using:
- Accuracy:
where is the set of annotated valid applications for task .
- Category-Level Confusion:
with misclassifications decomposed as intra-category error () and cross-category error ().
5. Experimental Setup and Results
Nine representative LLMs were evaluated with deterministic decoding. Closed-source models included GPT-5 and GPT-4o-mini; open-source models included Qwen-2.5-7B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-30B-A3B-Instruct-2507, Llama-3-8B, Phi-4, Gemma-3-4B-pt, and Gemma-3-270M. All models received identical system prompts and candidate lists. Zero-shot and few-shot differed only by the exemplar inclusion; RAS appended approximately 100 lines of structured application descriptions.
Key quantitative results:
| Selector/Model | Zero-Shot | Few-Shot | RAS | Overall Avg |
|---|---|---|---|---|
| Random Selector | – | – | – | 1.6% |
| Rule-based Heuristic | – | – | – | 56.0% |
| GPT-5 | 62.0% | 63.5% | 64.4% | 63.3% |
| GPT-4o-mini | 60.3% | – | – | 60.3% |
| Qwen-2.5-7B-Instruct | 53.0% | 55.0% | 57.4% | – |
| Llama-3-8B | 54.2% | – | – | – |
| Phi-4 | 54.1% | – | – | – |
| Gemma-3-270M | 9.7% | – | – | – |
| Gemma-3-4B-pt | 37.6% | – | – | – |
Few-shot prompting yields consistent average increases (~2%), while RAS provides up to a 5% benefit for mid-scale models, indicating the value of explicit capability grounding for less parameter-rich agents.
Category-level results highlight that Streaming & Social Video (62.3%) and Windows Core Apps (58.1%) yield higher accuracy, reflecting well-bounded functionalities. Gaming & Game Utilities (33.1%) and Music & Media Players (35.4%) exhibit the lowest scores due to the prevalence of near-synonymous tool sets and ambiguity. Misclassification analysis reveals that three-quarters () of model errors are cross-category, rather than within-category (), e.g., confusing file management with cloud storage applications. Per-application F1 scores also reflect tool boundary ambiguity, ranging from (Word) to (Notepad).
6. Significance, Limitations, and Future Directions
AppSelectBench establishes application-level selection as an explicit, quantifiable challenge for CUAs, exposing systematic strengths and deficiencies in contemporary agent architectures. Current state-of-the-art LLMs—even those with extensive world knowledge—exhibit substantial rates of inconsistent application selection, frequent cross-category errors, and difficulties in reasoning about multi-tool workflows. The benchmark enables comparative and ablation studies for both proprietary and open-source agents under unified protocols.
A plausible implication is that addressing these challenges will require not only larger or more data-rich models, but also architectural advances such as hierarchical selection strategies—first categorizing the intent, then disambiguating the application—or explicit modular reasoning components. Planned extensions include support for multi-application workflows (task graphs), deeper compositional reasoning, and further development of hierarchical and modular approaches toward robust, human-level application selection in CUAs.