The Amazing Agent Race: Strong Tool Users, Weak Navigators

Published 11 Apr 2026 in cs.AI, cs.CL, and cs.LG | (2604.10261v1)

Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper reveals that leading LLMs achieve a maximum of 37.2% finish-line accuracy, largely due to navigation bottlenecks.
The paper introduces a novel DAG-structured benchmark with 1,400 legs and fork-merge tool chains to assess complex compositional reasoning.
The paper demonstrates that while tool-use accuracy remains relatively stable, navigation errors dominate, pinpointing a critical area for future improvements.

Motivation and Benchmark Contributions

Recent advances in LLM-based agents have centered around compositional tool-use and web navigation. However, a systematic analysis of six leading benchmarks reveals that the majority (55–100%) of their instances are strictly linear, encompassing only shallow chains of two to five sequential steps. Such linearity under-represents the complexity encountered in realistic scenarios involving fork-merge dependencies and multi-branch tool workflows. "The Amazing Agent Race: Strong Tool Users, Weak Navigators" (2604.10261) addresses this compositionality deficit by introducing The Amazing Agent Race (AAR), a benchmark featuring exclusively DAG-structured puzzles—or “legs”—with fork-merge tool chains, generated from Wikipedia and coupled with live-API validation.

Figure 1: (a) Existing benchmarks exhibit 55--100% linearity; AAR uniquely provides 0% linearity, with all instances as true DAGs. (b) Best agent accuracy is 36.6% (aggregated across 1,400 legs).

AAR constructs 1,400 evaluation legs mapped into two variants: AAR-Linear (sequential chains, 800 legs) and AAR-DAG (compositional fork-merge DAGs, 600 legs). Every leg is procedurally generated from random Wikipedia seeds, spans four difficulty levels, and passes a stringent, automated quality pipeline to ensure solvability, contamination-resistance, and diverse tool composition. Notably, each puzzle in AAR is a true DAG with an average of 22 pit stops and up to five fork-merge diamonds—far surpassing the structural complexity of existing benchmarks.

Problem Formulation and Structural Features

In AAR, each instance presents the agent with a seed Wikipedia URL, a natural-language clue envelope that obfuscates explicit titles and tool names, the full set of 19 tools, and a constrained step budget. The agent must sequentially process clues, navigate across Wikipedia, orchestrate multi-step tool chains (composing up to three calls), and ultimately aggregate all necessary facts and computed values into a single-digit answer. Clues are paraphrased to minimize contamination risk and are coupled with live API calls (including temporal data such as weather or financial values), ensuring the required reasoning surpasses recall of static knowledge.

Figure 2: An AAR clue envelope (leg) as presented to the agent, utilizing paraphrased multi-step riddles instead of explicit Wikipedia titles or tool names.

The legs are constructed as DAGs via diamond patterns: a source pit-stop (usually a Wikipedia-derived fact) fans out into two or more parallel tool chains, each invoking different API workflows, which then merge at a downstream aggregation node. This structure creates explicit compositional dependencies absent from standard linear tool-use tasks.

Evaluation Metrics and Pipeline Overview

AAR introduces metrics decomposed along the agent pipeline to allow fine-grained diagnosis:

Finish-line Accuracy (FA): Exact match on the final single-digit answer.
Pit-stop Visit Rate (PVR): Fraction of golden Wikipedia page visits matched—an explicit measure of navigation quality.
Roadblock Completion Rate (RCR): Fraction of expected toolchain invocations performed on the correct arguments.

This decomposition isolates navigation failures from tool-use deficiencies and final computation errors, enabling robust attribution of agent weaknesses.

Figure 4: Main results on both benchmark variants. (a) FA, PVR, and RCR for AAR-Linear. (b) Metrics for AAR-DAG reveal sharp PVR drop (navigation difficulties) but relatively stable RCR (tool-use competence).

The benchmark generation proceeds through an automated eight-step pipeline, including web crawling, thematic route planning, leg construction, tool-chain validation, diamond-structured augmentation, golden-trace execution, and NL clue verbalization. All legs are subject to alignment and contamination-resistant checks before inclusion in the evaluation set.

Agent Performance and Error Analysis

Evaluation across 1,400 legs and three agent frameworks (OpenAI Codex CLI, Anthropic Claude Code, and a lightweight mini-swe-agent ReAct-style agent) yields critical findings:

Low Maximum Achievement: No agent configuration exceeds 37.2% FA on any variant, with navigation (PVR) consistently underperforming tool-use competence (RCR).
Navigation as Bottleneck: Navigation errors are the dominant failure (27–52% of cases), while tool-use breakdowns remain below 17%, even as tool chains grow longer and more compositional.
Architectural Sensitivity: Agent framework choice impacts performance as much as model scale; e.g., Claude Code matches Codex CLI performance despite using 6× fewer tokens.
Compositionality Amplifies Navigation Failure: Transitioning from AAR-Linear to AAR-DAG induces a 13–18pp decline in navigation performance, while tool-use remains relatively unaffected.

Figure 6: FA (finish-line accuracy) degrades monotonically with difficulty increases, with worst-case drops of 19pp, substantiating the growing impact of navigation complexity.

Detailed failure taxonomy demonstrates that agents often succeed at correct tool invocation, yet select or extract facts from incorrect navigation targets. Furthermore, agent strategies to recover from partial progress show that high navigation and tool-use rates do not always guarantee a correct answer—compositional aggregation remains an additional challenge.

Implications, Limitations, and Future Opportunities

AAR exposes limitations in current LLM agent architectures—specifically, their inability to robustly solve complex, compositional, real-world tasks requiring nontrivial navigation and fact discovery. Purely increasing model size or internal “reasoning” (as in the case of GPT-OSS-120B) does not overcome the navigation bottleneck and merely increases resource use without accuracy gains.

Shortcut solution strategies, where an agent bypasses navigation and reasons directly from tool argument inference, are detected by low PVR and remain a nontrivial success that AAR’s decomposed metrics can explicitly measure. However, even with these “shortcuts,” overall accuracy remains low, confirming the structural hardness of the benchmark.

Future research should address:

Targeted retrieval and verification over more exhaustive (and error-prone) searching.
Deeper integration of navigation and tool-use policies for complex DAGs with richer branching and more ambiguous clues.
Richer, domain-general compositionality (multi-season legs, cross-episode state, and more diverse sources than Wikipedia).

Conclusion

The AAR benchmark sets a new standard for evaluating tool-augmented LLM agents, demonstrating that strong tool-use alone does not translate into holistic problem-solving: navigation and compositional reasoning constitute the principal limitations. As LLM-based agents are increasingly deployed in open-ended and compositional real-world tasks, research must prioritize robust navigation, targeted multimodal fact-finding, and the fusion of navigation and tool utilization strategies. AAR provides the framework and diagnostics for such progress.

Markdown Report Issue