ErrorAtlas: Taxonomy of LLM Failure Modes

Updated 29 January 2026

ErrorAtlas is a hierarchical taxonomy that defines 17 top-level error categories to diagnose and analyze LLM failure modes.
It employs the ErrorMap pipeline to extract, label, and cluster error instances, providing granular insights into model mistakes.
The framework aids in model debugging, benchmark validation, and research by revealing detailed failure signatures beyond aggregate metrics.

ErrorAtlas is a static, model-agnostic taxonomy of failure modes for LLMs, designed to systematically organize and clarify model mistakes across tasks and domains. It provides a fixed, hierarchical structure of 17 top-level error categories, each with detailed descriptions and supporting examples. ErrorAtlas empowers developers and researchers to move beyond aggregate performance metrics and gain targeted insight into the underlying causes of LLM failures, thus supporting model debugging, benchmark validation, and targeted research interventions. By operationalizing "failure signatures," ErrorAtlas reveals not just where, but why, models go wrong—a level of granularity not attainable with standard benchmarks alone (Ashury-Tahan et al., 22 Jan 2026).

1. Motivation and Theoretical Foundation

Traditional LLM evaluation benchmarks, such as those assessing accuracy on reasoning or question-answering, signal when a model fails but do not resolve the underlying reasons why such failures occur. A poor score might result from a variety of sources: formatting mistakes, misinterpreted instructions, computational errors, or even test set anomalies. ErrorAtlas addresses this limitation by providing a standardized classification of failure types, systematically disentangling root causes and surfacing underexplored error modes (e.g., omissions or misinterpretation of requirements), thereby offering a means to extract and analyze model "failure signatures"—ensemble patterns of mistakes that individual LLMs make distinctly (Ashury-Tahan et al., 22 Jan 2026).

2. The ErrorMap Pipeline: Extracting and Structuring Failure Data

ErrorAtlas is derived empirically using ErrorMap, a dynamic pipeline for extracting, labeling, and hierarchically categorizing model errors. ErrorMap operates in two principal stages:

Stage 1: Per-Instance Error Analysis

Incorrect Instance Identification: All incorrect predictions are selected using the benchmark’s primary metric. The failure threshold $\tau$ defaults to $0.7\times\max$ score for non-binary tasks.
LLM-Driven Annotation: Each failed example is adjudicated by prompting a judge LLM with the instance input, ground-truth references, a few "Informative Correct Predictions" (ICPs) from other models, and the model’s erroneous output.
Criteria Extraction and Error Labeling: For each case, the judge lists necessary reasoning steps or answer components, assesses presence/quality for each in the output, and generates a short, descriptive free-form error label (e.g., “Misinterpretation of Specification”).

Stage 2: Taxonomy Hierarchy Construction

Category Generation: Free-form error labels are clustered via iterative data mining (LLM-in-the-loop clustering, following Wan et al. 2024) into candidate categories; clusters are refined to ensure semantic coherence and non-overlap.
Error Assignment and Hierarchy: Finalized categories are batch-assigned to error instances, which are then organized into a multi-layer taxonomy (top-level, subcategories, sub-subcategories).

The pipeline is codified as:

function ErrorMap(model_outputs, references, inputs):
    failures = select_failures(model_outputs, threshold=τ)
    analyses = []
    for batch in chunk(failures, size=500):
        analyses += judge_per_instance(batch, references, inputs)
    error_labels = [a.label for a in analyses]
    categories = generate_taxonomy(error_labels)  # Stage 2.a
    assignments = classify_labels(analyses, categories)  # Stage 2.b
    return build_hierarchy(assignments)

Key formulas employed:

Failure threshold: An instance is counted as a failure if $score_i < \tau$ .
Cosine similarity: Used for error label embedding checks,

$\mathrm{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$

Statistical testing: Significance of error rate differences computed with a binomial test,

$p = \sum_{k\ge x} \binom{n}{k}p_0^{k}(1-p_0)^{n-k}$

where $p_0$ is the stronger model’s error rate and $x$ is observed errors in the weaker.

3. Taxonomy Structure: ErrorAtlas Categories

ErrorAtlas’s 17 top-level error categories—ranked by empirical prevalence—enable granular mapping of LLM failure landscapes. Each has subcategories and canonical examples. The five most prevalent types, with definitions and illustrative instances, are as follows:

Category	Definition	Example (from (Ashury-Tahan et al., 22 Jan 2026))
Missing Required Element	Omitting mandatory sections, fields, or content	Failing to list all inactive ingredients in a medication QA
Specification Misinterpretation	Misunderstanding requirements, formats, or constraints	Swapping row/column roles on a table reasoning task
Logical Reasoning Error	Faulty inference or application of logic rules	Inferring "all A are C" from "some A are B" and "all B are C"
Incorrect Identification	Mislabeling objects, concepts, entities	Entity or role misidentification
Computation Error	Incorrect numeric or algebraic calculations	Using the wrong combinatorial formula in a math problem

Additional high-level categories (with subcategories present) include Output Formatting Error, Irrelevant/Extraneous Content, Counting/Enumeration Error, Answer Selection Error, Incomplete Reasoning, Factual Error, Tool/API Usage Error, Naming/Symbol Error, Inappropriate Refusal, Unit Conversion Error, False Positive Detection, and Error Detection Failure. Each category is defined precisely and is supported by structured per-instance evidence and free-text summaries.

A full three-layer taxonomy, with sub- and sub-subcategories, is provided in the appendix of (Ashury-Tahan et al., 22 Jan 2026).

4. Empirical Construction and Statistical Insights

ErrorAtlas was built by applying ErrorMap to predictions from 83 diverse LLMs across 35 datasets spanning general reasoning (HELM Capabilities, Capabilities leaderboard), medical (MedHELM), table-based reasoning (ToRR), function-calling (BFCL-v4), and code generation (HumanEval, MBPP). Sampling $\sim10\%$ of failed model-instance pairs (approx. 7000 examples) per model×dataset enabled robust category prevalence estimation.

Findings include:

Over 40% of errors in high-profile reasoning benchmarks (MMLU-Pro, Omni-MATH, GPQA) stem from technical issues (format violations, omissions) rather than deficient reasoning per se.
The majority error modes by prevalence are Missing Required Element (15.56%), Specification Misinterpretation (11.50%), Logical Reasoning Error (9.09%), Incorrect Identification (8.98%), and Computation Error (8.45%).
Noteworthy model-specific error profiles: Gemini 2.0 Flash Lite exhibits frequent omissions, Claude 3.5 Haiku displays increased logical reasoning issues, and Mixtral 8x22b errs more on computation.
ErrorAtlas identifies underexplored failure classes—particularly Missing Required Element and Specification Misinterpretation—as both frequent and impactful.

5. Applications and Use Cases

ErrorAtlas supports several research and engineering workflows:

Model Debugging & Version Diffing: Enables comparison of error-type distributions between model variants (e.g., Gemini 1.5 Flash vs. Pro) to detect regressions or confirm targeted improvements.
Benchmark Validation and Curation: Assesses whether new or existing benchmarks test the intended skills. A case study on MMLU-Pro demonstrated ErrorMap’s ability to reproduce the benchmark’s own manual error annotation, validating its domain coverage.
Model Selection and Deployment: Facilitates application-specific model choice (e.g., selecting a model with lower hallucination rates for healthcare scenarios).
Shaping Research Directions: Identifies neglected but common error types (e.g., omissions, misreading prompts) as promising targets for prompt engineering, fine-tuning strategies, or architectural adaptation.

6. Limitations and Prospects

Several methodological constraints and future challenges are acknowledged:

ErrorAtlas’s applicability is currently limited to tasks with generative, explainable outputs; tasks requiring only classification without chain-of-thought are less amenable.
As ErrorMap relies on an LLM-based "judge," quality is bounded by the accuracy (validated to ∼92%) of the underlying LLM; the in-the-loop design introduces some circularity.
Coverage may be incomplete for niche domains outside the 35 analyzed datasets.
Forthcoming enhancements include periodic taxonomy updates with new benchmarks/models, development of semi-automatic remediation tools, and integration of white-box (internal model state) signals to complement output-based analysis.

7. Summary and Impact

ErrorAtlas, constructed through the systematic ErrorMap pipeline, establishes a reproducible, hierarchical framework for classifying and analyzing LLM failure modes. It advances transparency in LLM evaluation by providing actionable diagnostic layers that extend beyond success/failure task-level metrics, and enables researchers and practitioners to monitor, compare, and iteratively refine LLMs with increased accountability and precision (Ashury-Tahan et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ErrorAtlas.