Generalized Agent Benchmark (GAIA)

Updated 19 April 2026

GAIA is a rigorously constructed evaluation suite that challenges language-model-based agents with complex, real-world tool-mediated tasks.
It features a diverse taxonomy including information retrieval, document analysis, code execution, and multi-step planning to test agent robustness.
Empirical results reveal significant human-agent performance gaps, underscoring the need for enhanced tool integration and robust agent architectures.

The Generalized Agent Benchmark (GAIA) is a rigorously constructed evaluation suite for measuring the capabilities of language-model-based agents performing complex, tool-mediated real-world tasks. GAIA is structured to isolate the challenges of open-ended reasoning, cross-domain robustness, tool use, long-horizon planning, and multimodal comprehension, providing an indispensable testbed for both academic research and system development in general AI assistant architectures (Mialon et al., 2023, Hofman et al., 21 May 2025, Liu et al., 1 Oct 2025, Żywot et al., 16 Jan 2026, Xie et al., 13 Aug 2025).

1. Objectives, Philosophy, and Core Structure

GAIA was developed in response to the observation that contemporary benchmarks for LLM-based agents predominantly target either (a) specialized professional or academic domains already being saturated by advanced models, or (b) synthetic, controlled settings which fail to capture the uncertainty and compositionality of realistic tasks. In contrast, GAIA targets authentic “bread-to-toilet-paper shopping” scenarios—conceptually simple but operationally demanding domains where human users are robust and current agents are brittle (Mialon et al., 2023).

The guiding philosophy is to require multi-step, compositional reasoning that integrates the following fundamental abilities:

External tool invocation (e.g., code execution, browser automation, file parsing)
Robust information retrieval and web navigation
Multimodal understanding (images, spreadsheets, audio)
Closed-form answer production for real-world questions whose solutions are not extractable from plain text corpora

Benchmark questions are designed so that answers are short, unique, and unambiguous, and cannot be solved by simple memorization or templated prompts (Mialon et al., 2023).

2. Task Taxonomy, Dataset Composition, and Difficulty Levels

GAIA offers a taxonomy spanning several real-world agentic task categories (Hofman et al., 21 May 2025):

Information Retrieval: Tasks requiring web lookups, HTML/JSON parsing ("What is the population of Liechtenstein according to the CIA World Factbook?”)
Document Analysis: Tasks necessitating file operations, including regex or OCR on digital documents (“Given this PDF of the EU Digital COVID Certificate, extract the expiration date.”)
Code Interpretation/Execution: Procedural and computational tasks typically demanding isolated code evaluation in sandboxes (“Given this Python snippet, what does calling foo(5) print?”)
Simple Computation and Conversion: Use of calculators or unit-conversion libraries (“Convert 45 miles per hour to meters per second.”)
Planning and Multi-step Reasoning: Open-ended instructions that trigger chained API calls and cross-domain planning (“Plan a 3-day walking tour of central Kyoto that visits at least one shrine each day.”)

Questions are stratified into three difficulty levels reflecting multi-step complexity and tool composition:

Level 1: Single-step, ≤1 tool (e.g., basic lookups, 56–146 questions depending on split)
Level 2: Multi-step, multiple tools (e.g., code followed by table parsing; 53–245 questions)
Level 3: Arbitrarily long, complex tool orchestration, full browser control (omitted in some evaluations for tractability; 26–75+ questions) (Żywot et al., 16 Jan 2026, Xie et al., 13 Aug 2025)

Each question is tagged with associated capabilities and required domains, including web, document, code, and multimodality. The primary suite includes 165–466 annotated items for validation and leaderboard settings, with test set answers retained for public competition (Mialon et al., 2023, Liu et al., 1 Oct 2025, Hofman et al., 21 May 2025).

3. Evaluation Protocols and Metrics

The canonical GAIA scoring metric is exact-match accuracy over reference strings: $\mathrm{Accuracy}^{\mathrm{GAIA}} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\bigl(\hat y_i = y_i^{\text{ref}}\bigr)$ where $N$ is the number of tasks, $\hat{y}_i$ the agent answer, and $y_i^{\text{ref}}$ the gold reference (Hofman et al., 21 May 2025).

For analyses involving multiple runs or stochasticity, the Pass@N metric is used: $\text{Pass@N} = 1 - (1 - p)^N$ where $p$ is empirical per-task success probability (Liu et al., 1 Oct 2025, Żywot et al., 16 Jan 2026).

Aggregated scores may be computed by weighted or macro-averaged per-level accuracies. The regret gap (Pass@3 – Pass@1) quantifies single-pass reliability (Xie et al., 13 Aug 2025).

In advanced evaluation, the Agent GPA (Goal–Plan–Action) framework decomposes agent error into:

Goal Fulfillment (does the outcome fully satisfy the user’s goal)
Logical Consistency (consistency and absence of hallucination)
Execution Efficiency (minimization of redundant or failed tool calls)
Plan Quality (correct, complete, non-redundant decomposition)
Plan Adherence (fidelity of execution to the initial plan)

Each trace is scored on a {0,1,2,3} rubric per metric with mean aggregation over the dataset (Jia et al., 9 Oct 2025).

4. System Architectures and Agentic Frameworks

GAIA has catalyzed the development and assessment of a variety of agentic system architectures:

Collective Multi-Agent Ensembles: Systems such as JoyAgent–JDGenie fuse Plan–Execute (Supervisor) and ReAct (Worker) paradigms, deploying a critic voting model for output adjudication (Liu et al., 1 Oct 2025). Hierarchical memory (working, semantic, procedural) facilitates context management and knowledge integration.
Profile-Aware Supervision: AWorld leverages a dynamic MAS where an Execution Agent is supervised by a Guard Agent. The latter is informed by an offline “performance fingerprint” (category-wise error vector), enabling proactive, category-specific feedback analogous to feed-forward/feedback control in engineering (Xie et al., 13 Aug 2025).
Explicit Agentic-Reasoning: Configurations include ‘no-thinking’ (direct action), ‘planner-only’ (high-level plan followed by stepwise execution), and ‘full-thinking’ (chain-of-thought reasoning with tool calls). Tool augmentation (search, code, mind-map) provides the most consistent gains, especially for smaller models (Żywot et al., 16 Jan 2026).
Tool Suite Minimalism: Consistently, systems that focus on a concise set of robust tool APIs (e.g., search, code, multimodal parsers) achieve superior stability and scalability relative to tool-bloated alternatives (Liu et al., 1 Oct 2025).

5. Empirical Results and Comparative Findings

Human baseline accuracy on GAIA reaches 92% overall, with Level 1/2 at ≈94%/92% and Level 3 at 87%. In stark contrast, even GPT-4 with plugins achieves only 13–30% on Level 1, under 10% on Level 2, and 0% on Level 3 in single-pass settings (Mialon et al., 2023). For open-source models, significant advances are observed with effective tool use and multi-agent orchestration (Liu et al., 1 Oct 2025, Xie et al., 13 Aug 2025, Żywot et al., 16 Jan 2026):

Model/System	Pass@1 (Overall)	Pass@1 (L1)	Pass@1 (L2)	Pass@1 (L3)
Human	92%	94%	92%	87%
GPT-4 + plugins	13.3%	30.3%	9.7%	0%
JoyAgent–JDGenie Fusion	75.2%	86.8%	77.9%	42.3%
AWorld Profile-Aware MAS	70.95%	–	–	–
Qwen3 4B (no tools)	6.06%	–	–	–
Qwen3 4B (agentic NT)	13.33%	–	–	–
Qwen3 32B (no tools)	9.70%	–	–	–
Qwen3 32B (agentic NT)	25.45%	–	–	–

Tool augmentation consistently yields the largest performance uplifts, with small, tool-augmented models outperforming much larger non-tool-using counterparts. Multi-agent and critic-ensemble architectures further raise both performance and stability (Liu et al., 1 Oct 2025, Xie et al., 13 Aug 2025, Żywot et al., 16 Jan 2026).

Profile-aware MAS reduces variance (std down >50%) and regret gap while boosting mean accuracy (Xie et al., 13 Aug 2025). Fusion of planning and acting, with hierarchically managed memory, substantially improves long-horizon and complex (Level 3) task accuracy (Liu et al., 1 Oct 2025).

6. Multilingual Extensions and Robustness

To address concerns of global accessibility and performance degradation in non-English contexts, GAIA was integrated as the real-world task component of the MAPS (Multilingual Agentic Performance and Security) benchmark (Hofman et al., 21 May 2025). The MAPS translation pipeline combines NMT with LLM-based semantic integrity checks and manual spot-verification (94.4% answerable).

Empirically, GAIA tasks exhibit a significant “multilingual effect”: English Pass@1 baseline 78.2% ± 1.2% drops to a mean 61.4% ± 2.3% in non-English settings (a –16.8 point gap), with the highest degradation in languages with maximal non-English token content. No systematic vulnerabilities beyond performance drops were observed, but format corruptions in translation caused rare failures—eliminated by human verification.

The MAPS analysis highlights that natural-language heavy, tool-enabled reasoning benchmarks such as GAIA require multilingual-tuned pretraining, code + English scaffolding in prompts, and augmented reference answers to mitigate exact-match brittleness (Hofman et al., 21 May 2025).

7. Limitations, Recommendations, and Future Directions

GAIA, while representing a significant leap over prior benchmarks in realism and transferability, still omits certain real-world constraints, such as asynchronous execution, temporal deadlines, and explicit collaboration. The Gaia2 extension (within the ARE framework) introduces asynchronous simulation, environmental noise, ambiguity, temporal constraints, and agent-to-agent collaboration—surfacing novel failure modes and further stressing general agentic capabilities (Andrews et al., 21 Sep 2025).

Key recommendations for future benchmark evolution include:

Stratifying tasks by language-sensitivity and capability axis
Designing standardized, compositional, multilingual, and multimodal templates
Incorporating budget-scaling, latency, and accuracy frontiers for real-world deployment constraints
Extending verifiers with hard+soft checks and asymmetric oracle policies
Reporting multidimensional metrics capturing accuracy, cost, and temporal factors

A plausible implication is that progress towards robust AGI, as found in the GAIA gap (human–agent delta), will depend on continual integration of modular tool suites, dynamic agent orchestration, cross-lingual and cross-modal adaptation, and real-time scenario extension within simulation ecosystems (Mialon et al., 2023, Hofman et al., 21 May 2025, Andrews et al., 21 Sep 2025).

References:

(Mialon et al., 2023, Hofman et al., 21 May 2025, Jia et al., 9 Oct 2025, Liu et al., 1 Oct 2025, Żywot et al., 16 Jan 2026, Xie et al., 13 Aug 2025, Andrews et al., 21 Sep 2025)