AppSelectBench: Application Selection Benchmark

Updated 2 December 2025

AppSelectBench is a large-scale benchmark that evaluates language-driven CUAs’ capacity to choose the correct desktop application given a natural language user intent.
It leverages a dataset of over 100,000 tasks across 100 popular applications and employs methods such as zero-shot, few-shot, and retrieval-augmented selection.
The benchmark reveals performance variations across app categories, highlighting the need for hierarchical reasoning and modular strategies in future agent designs.

AppSelectBench is a large-scale, application-level benchmark designed to evaluate the capacity of language-driven Computer-Using Agents (CUAs) to select the correct desktop application given a natural-language user goal, prior to any invocation of fine-grained APIs. It systematically addresses the gap in tool-use evaluation frameworks left by benchmarks that exclusively assess intra-application API selection, without considering the fundamental decision of inter-application choice. AppSelectBench features a dataset of over 100,000 tasks spanning 100 widely-used desktop applications and a unified set of evaluation protocols, exposing current limitations and prospects in application-level reasoning for intelligent agents (Chen et al., 25 Nov 2025).

1. Motivation and Problem Statement

Application selection constitutes a fundamental step in human and agent workflows, determining which software environment should be initialized before granular tool usage. This operation is necessary to prevent orchestration errors, restrict irrelevant tool access, and optimize context relevance. Pre-existing benchmarks such as API-Bank and ToolBench freeze the application layer, evaluating only how agents invoke APIs within a single, preselected application. AppSelectBench reformulates the selection step: given a natural-language user intent $\mathcal{U}$ and a candidate set of desktop applications $\mathcal{T} = \{t_1, ..., t_n\}$ , the benchmark defines the task as learning a function

$f: \mathcal{U} \to G(\mathcal{T}),$

where $G(\mathcal{U}) = (V, E)$ is a directed graph over $\mathcal{T}$ encoding temporal or logical dependencies. In the studied singleton regime, this reduces to selecting a single correct application (or an interchangeable subset) matching the user's high-level intent.

2. Dataset Construction and Task Generation Pipeline

AppSelectBench’s dataset comprises more than 100,000 realistic and semantically grounded user instructions distributed across 100 desktop applications. The creation process follows a four-stage pipeline:

Atomic Task Curation: Approximately 3,000 atomic operations are drafted using GPT models and human refinement, each defined with explicit argument schemas (e.g., Excel.SUM(column), PowerPoint.CreateFromTemplate).
Workflow Composition: A composition engine generates multi-step workflows by sampling and chaining primitives under logical, temporal, and functional constraints (e.g., Excel.OpenFile → Excel.AddChart → PowerPoint.InsertChart).
Argument Generation: Abstract arguments are instantiated through rule-based templates, probabilistic samplers, or small generative models, grounding each step with realistic parameters (e.g., generate_city_name(), generate_random_number()).
Instruction Narration: Stepwise narrations are assembled, with selective omission ("step-wise dropout") to mimic natural conciseness, followed by paraphrasing with LLMs for fluency.

A 10% stratified evaluation yielded high human ratings: grammatical naturalness (4.7/5), semantic realism (4.6/5), and ground-truth correctness (99.8%), providing strong evidence for dataset quality.

3. Benchmark Scope and Application Coverage

AppSelectBench tasks span 100 applications categorized into 12 high-level domains, including browsers and search, office and knowledge work, communication, developer and sysadmin tools, creative and content production, music and media players, streaming and social video, gaming utilities, system utilities, and AI assistants. Each application contains on average approximately 1,000 user tasks. Example applications and categories include:

Category	Example Applications
Office & Knowledge Work	Word, Excel, PowerPoint, OneNote
Developer & Sysadmin Tools	VS Code, RStudio, MATLAB, PowerShell
Creative & Content Production	Photoshop, Blender, CapCut
Communication & Collaboration	Teams, Slack, Zoom
Streaming & Social Video	YouTube, Netflix, TikTok
Gaming & Game Utilities	Steam, Solitaire
Windows Core Apps	File Explorer, Settings, Disk Cleanup
AI Assistants	M365 Copilot, Microsoft Copilot

Tasks range in specificity and complexity, from "Calculate the total sales by region" (Excel) to "Search for 'Shape of You' and play it" (Spotify).

4. Evaluation Regimes and Metrics

AppSelectBench evaluates application selection using five principal regimes:

Random Selector: Uniform random choice (empirical lower bound ≈1.6% accuracy).
Rule-based Heuristic: Keyword-lexicon matching between user query and application names/functions.
Zero-Shot Prompting: Only the user task is provided; the model predicts the application.
Few-Shot Prompting: The prompt is prepended with 3–5 exemplars of (Task→Application) pairs.
Retrieval-Augmented Selection (RAS): The model is supplied with structured textual capability descriptions of all candidate applications, retrieved from an external knowledge base.

Performance is assessed using:

Accuracy:

$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\hat{t}_i \in \mathcal{A}_i\},$

where $\mathcal{A}_i$ is the set of annotated valid applications for task $i$ .

Category-Level Confusion:

$C_{ij} = \Pr(\mathrm{cat}(\hat{t})=j \mid \mathrm{cat}(t)=i),$

with misclassifications decomposed as intra-category error ( $\pi_{\mathrm{intra}}$ ) and cross-category error ( $\pi_{\mathrm{cross}} = 1 - \pi_{\mathrm{intra}}$ ).

5. Experimental Setup and Results

Nine representative LLMs were evaluated with deterministic decoding. Closed-source models included GPT-5 and GPT-4o-mini; open-source models included Qwen-2.5-7B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-30B-A3B-Instruct-2507, Llama-3-8B, Phi-4, Gemma-3-4B-pt, and Gemma-3-270M. All models received identical system prompts and candidate lists. Zero-shot and few-shot differed only by the exemplar inclusion; RAS appended approximately 100 lines of structured application descriptions.

Key quantitative results:

Selector/Model	Zero-Shot	Few-Shot	RAS	Overall Avg
Random Selector	–	–	–	1.6%
Rule-based Heuristic	–	–	–	56.0%
GPT-5	62.0%	63.5%	64.4%	63.3%
GPT-4o-mini	60.3%	–	–	60.3%
Qwen-2.5-7B-Instruct	53.0%	55.0%	57.4%	–
Llama-3-8B	54.2%	–	–	–
Phi-4	54.1%	–	–	–
Gemma-3-270M	9.7%	–	–	–
Gemma-3-4B-pt	37.6%	–	–	–

Few-shot prompting yields consistent average increases (~2%), while RAS provides up to a 5% benefit for mid-scale models, indicating the value of explicit capability grounding for less parameter-rich agents.

Category-level results highlight that Streaming & Social Video (62.3%) and Windows Core Apps (58.1%) yield higher accuracy, reflecting well-bounded functionalities. Gaming & Game Utilities (33.1%) and Music & Media Players (35.4%) exhibit the lowest scores due to the prevalence of near-synonymous tool sets and ambiguity. Misclassification analysis reveals that three-quarters ( $\pi_{\mathrm{cross}} = 0.766$ ) of model errors are cross-category, rather than within-category ( $\pi_{\mathrm{intra}} = 0.234$ ), e.g., confusing file management with cloud storage applications. Per-application F1 scores also reflect tool boundary ambiguity, ranging from $F_1 \approx 0.96$ (Word) to $F_1 \approx 0.50$ (Notepad).

6. Significance, Limitations, and Future Directions

AppSelectBench establishes application-level selection as an explicit, quantifiable challenge for CUAs, exposing systematic strengths and deficiencies in contemporary agent architectures. Current state-of-the-art LLMs—even those with extensive world knowledge—exhibit substantial rates of inconsistent application selection, frequent cross-category errors, and difficulties in reasoning about multi-tool workflows. The benchmark enables comparative and ablation studies for both proprietary and open-source agents under unified protocols.

A plausible implication is that addressing these challenges will require not only larger or more data-rich models, but also architectural advances such as hierarchical selection strategies—first categorizing the intent, then disambiguating the application—or explicit modular reasoning components. Planned extensions include support for multi-application workflows (task graphs), deeper compositional reasoning, and further development of hierarchical and modular approaches toward robust, human-level application selection in CUAs.

(Chen et al., 25 Nov 2025)

Markdown Report Issue Upgrade to Chat

References (1)

AppSelectBench: Application-Level Tool Selection Benchmark (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AppSelectBench.

AppSelectBench: Application Selection Benchmark

1. Motivation and Problem Statement

2. Dataset Construction and Task Generation Pipeline

3. Benchmark Scope and Application Coverage

4. Evaluation Regimes and Metrics

5. Experimental Setup and Results

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AppSelectBench: Application Selection Benchmark

1. Motivation and Problem Statement

2. Dataset Construction and Task Generation Pipeline

3. Benchmark Scope and Application Coverage

4. Evaluation Regimes and Metrics

5. Experimental Setup and Results

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research