ErrorMap Pipeline: LLM Error Diagnosis

Updated 30 April 2026

ErrorMap pipeline is a systematic framework for diagnosing, categorizing, and quantifying failure modes in LLM outputs using a two-stage error analysis and taxonomy induction.
It leverages LLM-based prompts to generate structured JSON diagnostics, enabling precise mapping of error signatures and robust cross-model comparisons.
The framework supports dynamic error taxonomies like ErrorAtlas that capture up to 95% of observed errors, facilitating rapid deployment across diverse datasets and models.

The ErrorMap pipeline is a structured framework for systematic diagnosis, categorization, and quantification of failure modes in LLM outputs. By translating raw, incorrect predictions into multi-tiered taxonomies and interpretable failure signatures, ErrorMap enables precise model debugging and cross-model comparability. The system underpins the construction of ErrorAtlas, a static, comprehensive taxonomy that exposes prevailing and underexplored LLM error types. ErrorMap operates purely on model outputs and ground-truth references, with no reliance on task-specific engineering or model internals, making it broadly applicable across LLMs and benchmarks (Ashury-Tahan et al., 22 Jan 2026).

1. Two-Stage Error Diagnosis and Taxonomy Induction

ErrorMap employs a two-stage procedure to extract structured diagnostics from model outputs and construct a robust, hierarchical taxonomy:

Stage 1: Per-Instance Error Analysis
- Inputs: For each failed model–dataset instance $(i, m)$ , the system takes the input text $x_i$ , gold answer $y_i$ (if available), up to $K$ informative correct predictions (ICPs) from other models, and one erroneous prediction $\hat{y}_i^m$ from the target model.
- Process: The input is formatted into a prompt for a designated grading LLM (e.g., gpt-oss-120b). The LLM produces a structured JSON analysis including:
- required_criteria: a list of necessary reasoning steps or sub-tasks, with evidence-based checks for satisfaction.
- error_title: a concise label identifying the first major error.
- error_summary: a brief diagnosis of root cause.
Stage 2: Taxonomy Hierarchy Building
- 2.a Category Generation: All free-form error_title strings from Stage 1 are clustered into broader categories using iterative LLM-in-the-loop clustering (generate, refine, review).
- 2.b Error Assignment: Each per-instance error label is mapped to a taxonomy category, creating a populated hierarchy spanning categories, sub-categories, and instance mappings.

This twin-pipeline yields a per-run multi-level error taxonomy that can be held static for consistent future application or evolved with new data and models.

2. Formal Metrics and Failure Signature Representation

The ErrorMap formalism is grounded in clear mathematical definitions and metrics.

Failure Instance Detection: For dataset $D=\{x_i,y_i\}_{i=1}^N$ and model $m$ , the task-specific score $\mathrm{score}(m,x_i)$ is compared to threshold $\tau$ . The failure indicator is defined:

$E_{m,i} = \begin{cases} 1 & \text{if } \mathrm{score}(m,x_i)<\tau\ 0 & \text{otherwise} \end{cases}$

Taxonomy Construction proceeds via assignment $x_i$ 0 mapping free-form error labels to one of $x_i$ 1 categories $x_i$ 2.
Failure signature vector (for model $x_i$ 3):

Let $x_i$ 4. The total number of failures is $x_i$ 5. The normalized failure signature $x_i$ 6 is given by:

$x_i$ 7

which is a model-specific vector quantifying the relative prevalence of failure categories.

Aggregate and Comparative Metrics:
- Overall error rate $x_i$ 8.
- Model-to-model failure signature distance $x_i$ 9 or $y_i$ 0.

3. Core Algorithms in ErrorMap

The principle algorithms are implemented in LLM-centric pipelines with minimal task-specific intervention.

Per-Instance Error Analysis (Algorithm 1): Prompts the LLM with instance data to output a diagnostic JSON—including required criteria (each with fields criterion, present, quality, evidence, comment), error_title, and error_summary.
Taxonomy Generation (Algorithm 2): Runs a clustering LLM prompt for category generation, refines via batches, and finalizes taxonomy with review.
Error Assignment (Algorithm 3): Batch classifies each error title against the taxonomy, populates category→instance mappings.
Failure Signature Computation (Algorithm 4): Canonically computes the normalized failure signature vector for each model given the error-label/category mappings.

All algorithms leverage LLM-based reasoning for both the extraction of fine-grained, context-dependent error types and the abstraction needed for robust clustering.

4. ErrorAtlas: Static Failure Taxonomy and Prevalence

Application of ErrorMap across 83 models and 35 datasets, with $y_i$ 17,000 sampled failure instances, led to the creation of ErrorAtlas—a fixed taxonomy that covers $y_i$ 2 of observed LLM failures.

Category	Description
Logical Reasoning Error	Fails in inference or applying correct reasoning steps.
Missing Required Element	Omits mandatory sections, fields, or specified content.
Computation Error	Incorrect numerical, algebraic, or geometric results.
Incorrect Identification	Mislabels or misidentifies objects, concepts, or entities.
Specification Misinterpret.	Misunderstands task requirements or misformats outputs.
Output Formatting Error	Violates required structure, punctuation, case, or markup rules.
Irrelevant/Extraneous Content	Generates off-topic or unnecessary information.
Counting/Enumeration Error	Over-/under-counts, omits cases in combinatorial steps.
Answer Selection Error	Maps a solution to the wrong multiple-choice option.
Incomplete Reasoning	Omits essential explanation, proof steps, or justification.
Factual Error	Supplies inaccurate or fabricated domain knowledge.
Tool/API Usage Error	Misuses or omits required tool or API calls.
Naming/Symbol Error	Uses incorrect symbols, variable names, or identifiers.
Inappropriate Refusal	Unjustifiably refuses to answer.
Unit Conversion Error	Incorrect unit or percentage conversions.
False Positive Detection	Flags errors or anomalies that do not exist.
Error Detection Failure	Fails to recognize existing mistakes (missed error).

Prevalence statistics (e.g., "Missing Required Element" appears in 31/35 datasets and 82/83 models, prevalence 15.6%; "Logical Reasoning Error" is in 25 datasets, 56 models, prevalence 9.1%) demonstrate wide coverage and specificity (Ashury-Tahan et al., 22 Jan 2026).

5. Representative Examples and Model Comparisons

Concrete applications elucidate the pipeline's interpretability and practical value:

Instance Diagnosis: For the OmniMath dataset and Gemini 2.0 Flash Lite, when asked "Solve $y_i$ 3 in positive integers," the model listed only (1,10,10) with no completeness check. ErrorMap labeled this as "Incomplete solution set" and pinpointed partially-met criteria ("examine all feasible a", "list all permutations").
Failure Signature Comparisons: Figure 1 contrasts models (e.g., Claude 3.5 Haiku vs. Gemini 2.0 Flash Lite) by per-category failure fraction, such as high "Missing Required Element" for Gemini and amplified "Logical Reasoning Error" for Claude, illustrating model-specific weaknesses.

6. Generalizability and Deployment Practices

ErrorMap is agnostic to the underlying model architecture and dataset specifics. Its sole requirements are input–output pairs (optionally with ground truth and correct references), a judge LLM, and standard prompt templates. The pipeline does not demand token-level alignments or model white-box access.

Immediate Applicability: Any LLM or dataset, including emergent benchmarks, can be analyzed by running Stage 1 and Stage 2.b against ErrorAtlas to produce failure signatures.
Practical Deployment: For expediency, users often bypass Stage 2.a (new taxonomy induction), relying on ErrorAtlas for robust, static categorization. Running Stage 1 for error titles and Stage 2.b for assignment enables rapid, cost-efficient diagnosis without loss of diagnostic integrity.

This approach yields both detailed, per-instance explanations and condensed, model-level diagnostics, advancing interpretability and comparability in LLM evaluation frameworks (Ashury-Tahan et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ErrorMap Pipeline.

ErrorMap Pipeline: LLM Error Diagnosis

1. Two-Stage Error Diagnosis and Taxonomy Induction

2. Formal Metrics and Failure Signature Representation

3. Core Algorithms in ErrorMap

4. ErrorAtlas: Static Failure Taxonomy and Prevalence

5. Representative Examples and Model Comparisons

6. Generalizability and Deployment Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ErrorMap Pipeline: LLM Error Diagnosis

1. Two-Stage Error Diagnosis and Taxonomy Induction

2. Formal Metrics and Failure Signature Representation

3. Core Algorithms in ErrorMap

4. ErrorAtlas: Static Failure Taxonomy and Prevalence

5. Representative Examples and Model Comparisons

6. Generalizability and Deployment Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research