ErrorMap Pipeline: LLM Error Diagnosis
- ErrorMap pipeline is a systematic framework for diagnosing, categorizing, and quantifying failure modes in LLM outputs using a two-stage error analysis and taxonomy induction.
- It leverages LLM-based prompts to generate structured JSON diagnostics, enabling precise mapping of error signatures and robust cross-model comparisons.
- The framework supports dynamic error taxonomies like ErrorAtlas that capture up to 95% of observed errors, facilitating rapid deployment across diverse datasets and models.
ErrorMap Pipeline
The ErrorMap pipeline is a structured framework for systematic diagnosis, categorization, and quantification of failure modes in LLM outputs. By translating raw, incorrect predictions into multi-tiered taxonomies and interpretable failure signatures, ErrorMap enables precise model debugging and cross-model comparability. The system underpins the construction of ErrorAtlas, a static, comprehensive taxonomy that exposes prevailing and underexplored LLM error types. ErrorMap operates purely on model outputs and ground-truth references, with no reliance on task-specific engineering or model internals, making it broadly applicable across LLMs and benchmarks (Ashury-Tahan et al., 22 Jan 2026).
1. Two-Stage Error Diagnosis and Taxonomy Induction
ErrorMap employs a two-stage procedure to extract structured diagnostics from model outputs and construct a robust, hierarchical taxonomy:
- Stage 1: Per-Instance Error Analysis
- Inputs: For each failed model–dataset instance , the system takes the input text , gold answer (if available), up to informative correct predictions (ICPs) from other models, and one erroneous prediction from the target model.
- Process: The input is formatted into a prompt for a designated grading LLM (e.g., gpt-oss-120b). The LLM produces a structured JSON analysis including:
required_criteria: a list of necessary reasoning steps or sub-tasks, with evidence-based checks for satisfaction.error_title: a concise label identifying the first major error.error_summary: a brief diagnosis of root cause.
- Stage 2: Taxonomy Hierarchy Building
- 2.a Category Generation: All free-form
error_titlestrings from Stage 1 are clustered into broader categories using iterative LLM-in-the-loop clustering (generate, refine, review). - 2.b Error Assignment: Each per-instance error label is mapped to a taxonomy category, creating a populated hierarchy spanning categories, sub-categories, and instance mappings.
- 2.a Category Generation: All free-form
This twin-pipeline yields a per-run multi-level error taxonomy that can be held static for consistent future application or evolved with new data and models.
2. Formal Metrics and Failure Signature Representation
The ErrorMap formalism is grounded in clear mathematical definitions and metrics.
- Failure Instance Detection: For dataset and model , the task-specific score is compared to threshold . The failure indicator is defined:
- Taxonomy Construction proceeds via assignment 0 mapping free-form error labels to one of 1 categories 2.
- Failure signature vector (for model 3):
Let 4. The total number of failures is 5. The normalized failure signature 6 is given by:
7
which is a model-specific vector quantifying the relative prevalence of failure categories.
- Aggregate and Comparative Metrics:
- Overall error rate 8.
- Model-to-model failure signature distance 9 or 0.
3. Core Algorithms in ErrorMap
The principle algorithms are implemented in LLM-centric pipelines with minimal task-specific intervention.
- Per-Instance Error Analysis (Algorithm 1): Prompts the LLM with instance data to output a diagnostic JSON—including required criteria (each with fields
criterion,present,quality,evidence,comment),error_title, anderror_summary. - Taxonomy Generation (Algorithm 2): Runs a clustering LLM prompt for category generation, refines via batches, and finalizes taxonomy with review.
- Error Assignment (Algorithm 3): Batch classifies each error title against the taxonomy, populates category→instance mappings.
- Failure Signature Computation (Algorithm 4): Canonically computes the normalized failure signature vector for each model given the error-label/category mappings.
All algorithms leverage LLM-based reasoning for both the extraction of fine-grained, context-dependent error types and the abstraction needed for robust clustering.
4. ErrorAtlas: Static Failure Taxonomy and Prevalence
Application of ErrorMap across 83 models and 35 datasets, with 17,000 sampled failure instances, led to the creation of ErrorAtlas—a fixed taxonomy that covers 2 of observed LLM failures.
| Category | Description |
|---|---|
| Logical Reasoning Error | Fails in inference or applying correct reasoning steps. |
| Missing Required Element | Omits mandatory sections, fields, or specified content. |
| Computation Error | Incorrect numerical, algebraic, or geometric results. |
| Incorrect Identification | Mislabels or misidentifies objects, concepts, or entities. |
| Specification Misinterpret. | Misunderstands task requirements or misformats outputs. |
| Output Formatting Error | Violates required structure, punctuation, case, or markup rules. |
| Irrelevant/Extraneous Content | Generates off-topic or unnecessary information. |
| Counting/Enumeration Error | Over-/under-counts, omits cases in combinatorial steps. |
| Answer Selection Error | Maps a solution to the wrong multiple-choice option. |
| Incomplete Reasoning | Omits essential explanation, proof steps, or justification. |
| Factual Error | Supplies inaccurate or fabricated domain knowledge. |
| Tool/API Usage Error | Misuses or omits required tool or API calls. |
| Naming/Symbol Error | Uses incorrect symbols, variable names, or identifiers. |
| Inappropriate Refusal | Unjustifiably refuses to answer. |
| Unit Conversion Error | Incorrect unit or percentage conversions. |
| False Positive Detection | Flags errors or anomalies that do not exist. |
| Error Detection Failure | Fails to recognize existing mistakes (missed error). |
Prevalence statistics (e.g., "Missing Required Element" appears in 31/35 datasets and 82/83 models, prevalence 15.6%; "Logical Reasoning Error" is in 25 datasets, 56 models, prevalence 9.1%) demonstrate wide coverage and specificity (Ashury-Tahan et al., 22 Jan 2026).
5. Representative Examples and Model Comparisons
Concrete applications elucidate the pipeline's interpretability and practical value:
- Instance Diagnosis: For the OmniMath dataset and Gemini 2.0 Flash Lite, when asked "Solve 3 in positive integers," the model listed only (1,10,10) with no completeness check. ErrorMap labeled this as "Incomplete solution set" and pinpointed partially-met criteria ("examine all feasible a", "list all permutations").
- Failure Signature Comparisons: Figure 1 contrasts models (e.g., Claude 3.5 Haiku vs. Gemini 2.0 Flash Lite) by per-category failure fraction, such as high "Missing Required Element" for Gemini and amplified "Logical Reasoning Error" for Claude, illustrating model-specific weaknesses.
6. Generalizability and Deployment Practices
ErrorMap is agnostic to the underlying model architecture and dataset specifics. Its sole requirements are input–output pairs (optionally with ground truth and correct references), a judge LLM, and standard prompt templates. The pipeline does not demand token-level alignments or model white-box access.
- Immediate Applicability: Any LLM or dataset, including emergent benchmarks, can be analyzed by running Stage 1 and Stage 2.b against ErrorAtlas to produce failure signatures.
- Practical Deployment: For expediency, users often bypass Stage 2.a (new taxonomy induction), relying on ErrorAtlas for robust, static categorization. Running Stage 1 for error titles and Stage 2.b for assignment enables rapid, cost-efficient diagnosis without loss of diagnostic integrity.
This approach yields both detailed, per-instance explanations and condensed, model-level diagnostics, advancing interpretability and comparability in LLM evaluation frameworks (Ashury-Tahan et al., 22 Jan 2026).