Papers
Topics
Authors
Recent
Search
2000 character limit reached

ErrorMap Pipeline: LLM Error Diagnosis

Updated 30 April 2026
  • ErrorMap pipeline is a systematic framework for diagnosing, categorizing, and quantifying failure modes in LLM outputs using a two-stage error analysis and taxonomy induction.
  • It leverages LLM-based prompts to generate structured JSON diagnostics, enabling precise mapping of error signatures and robust cross-model comparisons.
  • The framework supports dynamic error taxonomies like ErrorAtlas that capture up to 95% of observed errors, facilitating rapid deployment across diverse datasets and models.

ErrorMap Pipeline

The ErrorMap pipeline is a structured framework for systematic diagnosis, categorization, and quantification of failure modes in LLM outputs. By translating raw, incorrect predictions into multi-tiered taxonomies and interpretable failure signatures, ErrorMap enables precise model debugging and cross-model comparability. The system underpins the construction of ErrorAtlas, a static, comprehensive taxonomy that exposes prevailing and underexplored LLM error types. ErrorMap operates purely on model outputs and ground-truth references, with no reliance on task-specific engineering or model internals, making it broadly applicable across LLMs and benchmarks (Ashury-Tahan et al., 22 Jan 2026).

1. Two-Stage Error Diagnosis and Taxonomy Induction

ErrorMap employs a two-stage procedure to extract structured diagnostics from model outputs and construct a robust, hierarchical taxonomy:

  • Stage 1: Per-Instance Error Analysis
    • Inputs: For each failed model–dataset instance (i,m)(i, m), the system takes the input text xix_i, gold answer yiy_i (if available), up to KK informative correct predictions (ICPs) from other models, and one erroneous prediction y^im\hat{y}_i^m from the target model.
    • Process: The input is formatted into a prompt for a designated grading LLM (e.g., gpt-oss-120b). The LLM produces a structured JSON analysis including:
    • required_criteria: a list of necessary reasoning steps or sub-tasks, with evidence-based checks for satisfaction.
    • error_title: a concise label identifying the first major error.
    • error_summary: a brief diagnosis of root cause.
  • Stage 2: Taxonomy Hierarchy Building
    • 2.a Category Generation: All free-form error_title strings from Stage 1 are clustered into broader categories using iterative LLM-in-the-loop clustering (generate, refine, review).
    • 2.b Error Assignment: Each per-instance error label is mapped to a taxonomy category, creating a populated hierarchy spanning categories, sub-categories, and instance mappings.

This twin-pipeline yields a per-run multi-level error taxonomy that can be held static for consistent future application or evolved with new data and models.

2. Formal Metrics and Failure Signature Representation

The ErrorMap formalism is grounded in clear mathematical definitions and metrics.

  • Failure Instance Detection: For dataset D={xi,yi}i=1ND=\{x_i,y_i\}_{i=1}^N and model mm, the task-specific score score(m,xi)\mathrm{score}(m,x_i) is compared to threshold Ï„\tau. The failure indicator is defined:

Em,i={1if score(m,xi)<τ 0otherwiseE_{m,i} = \begin{cases} 1 & \text{if } \mathrm{score}(m,x_i)<\tau\ 0 & \text{otherwise} \end{cases}

  • Taxonomy Construction proceeds via assignment xix_i0 mapping free-form error labels to one of xix_i1 categories xix_i2.
  • Failure signature vector (for model xix_i3):

Let xix_i4. The total number of failures is xix_i5. The normalized failure signature xix_i6 is given by:

xix_i7

which is a model-specific vector quantifying the relative prevalence of failure categories.

  • Aggregate and Comparative Metrics:
    • Overall error rate xix_i8.
    • Model-to-model failure signature distance xix_i9 or yiy_i0.

3. Core Algorithms in ErrorMap

The principle algorithms are implemented in LLM-centric pipelines with minimal task-specific intervention.

  • Per-Instance Error Analysis (Algorithm 1): Prompts the LLM with instance data to output a diagnostic JSON—including required criteria (each with fields criterion, present, quality, evidence, comment), error_title, and error_summary.
  • Taxonomy Generation (Algorithm 2): Runs a clustering LLM prompt for category generation, refines via batches, and finalizes taxonomy with review.
  • Error Assignment (Algorithm 3): Batch classifies each error title against the taxonomy, populates category→instance mappings.
  • Failure Signature Computation (Algorithm 4): Canonically computes the normalized failure signature vector for each model given the error-label/category mappings.

All algorithms leverage LLM-based reasoning for both the extraction of fine-grained, context-dependent error types and the abstraction needed for robust clustering.

4. ErrorAtlas: Static Failure Taxonomy and Prevalence

Application of ErrorMap across 83 models and 35 datasets, with yiy_i17,000 sampled failure instances, led to the creation of ErrorAtlas—a fixed taxonomy that covers yiy_i2 of observed LLM failures.

Category Description
Logical Reasoning Error Fails in inference or applying correct reasoning steps.
Missing Required Element Omits mandatory sections, fields, or specified content.
Computation Error Incorrect numerical, algebraic, or geometric results.
Incorrect Identification Mislabels or misidentifies objects, concepts, or entities.
Specification Misinterpret. Misunderstands task requirements or misformats outputs.
Output Formatting Error Violates required structure, punctuation, case, or markup rules.
Irrelevant/Extraneous Content Generates off-topic or unnecessary information.
Counting/Enumeration Error Over-/under-counts, omits cases in combinatorial steps.
Answer Selection Error Maps a solution to the wrong multiple-choice option.
Incomplete Reasoning Omits essential explanation, proof steps, or justification.
Factual Error Supplies inaccurate or fabricated domain knowledge.
Tool/API Usage Error Misuses or omits required tool or API calls.
Naming/Symbol Error Uses incorrect symbols, variable names, or identifiers.
Inappropriate Refusal Unjustifiably refuses to answer.
Unit Conversion Error Incorrect unit or percentage conversions.
False Positive Detection Flags errors or anomalies that do not exist.
Error Detection Failure Fails to recognize existing mistakes (missed error).

Prevalence statistics (e.g., "Missing Required Element" appears in 31/35 datasets and 82/83 models, prevalence 15.6%; "Logical Reasoning Error" is in 25 datasets, 56 models, prevalence 9.1%) demonstrate wide coverage and specificity (Ashury-Tahan et al., 22 Jan 2026).

5. Representative Examples and Model Comparisons

Concrete applications elucidate the pipeline's interpretability and practical value:

  • Instance Diagnosis: For the OmniMath dataset and Gemini 2.0 Flash Lite, when asked "Solve yiy_i3 in positive integers," the model listed only (1,10,10) with no completeness check. ErrorMap labeled this as "Incomplete solution set" and pinpointed partially-met criteria ("examine all feasible a", "list all permutations").
  • Failure Signature Comparisons: Figure 1 contrasts models (e.g., Claude 3.5 Haiku vs. Gemini 2.0 Flash Lite) by per-category failure fraction, such as high "Missing Required Element" for Gemini and amplified "Logical Reasoning Error" for Claude, illustrating model-specific weaknesses.

6. Generalizability and Deployment Practices

ErrorMap is agnostic to the underlying model architecture and dataset specifics. Its sole requirements are input–output pairs (optionally with ground truth and correct references), a judge LLM, and standard prompt templates. The pipeline does not demand token-level alignments or model white-box access.

  • Immediate Applicability: Any LLM or dataset, including emergent benchmarks, can be analyzed by running Stage 1 and Stage 2.b against ErrorAtlas to produce failure signatures.
  • Practical Deployment: For expediency, users often bypass Stage 2.a (new taxonomy induction), relying on ErrorAtlas for robust, static categorization. Running Stage 1 for error titles and Stage 2.b for assignment enables rapid, cost-efficient diagnosis without loss of diagnostic integrity.

This approach yields both detailed, per-instance explanations and condensed, model-level diagnostics, advancing interpretability and comparability in LLM evaluation frameworks (Ashury-Tahan et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ErrorMap Pipeline.