- The paper introduces a formal taxonomy that categorizes 35 distinct data error types in relational databases.
- It distinguishes between missing, incorrect, and redundant errors, clarifying their granular manifestations and detection challenges.
- The catalog underpins advanced error detection and cleaning strategies, enhancing data quality and AI/ML performance.
Introduction
"A Catalog of Data Errors" (2604.09277) delivers a comprehensive formalization and taxonomy of data errors and error indicators specific to relational databases (RDBs). The paper rigorously identifies 35 distinct error types, provides their formal definitions, clarifies their manifestations and granularity, and delivers unified terminology. This catalog addresses both classical data errors and complex statistical error indicators particularly relevant in AI/ML contexts. The systematic approach enables refined error detection and cleaning, and aligns with a growing need for robust data hygiene underlying machine learning systems and analytics.
The work introduces a precise formalism to differentiate between the representations of real-world entities and their database encodings. This is operationalized through a mapping function that explicitly relates database values or tuples to their real-world counterparts. Errors are classified according to their manifestation into three mutually exclusive categories: missing data, incorrect data, and redundant data. Error definitions are parameterized by granularity (value, tuple, attribute, table, or database) and context (syntactic/semantic).
The paper further addresses terminological inconsistencies in prior taxonomies and resolves ambiguity by decoupling data errors (objectively erroneous with respect to ground truth) from error indicators (statistical or logic-based patterns requiring further judgment). The framework applies across all tabular relational data, leveraging an entity-relationship example schema (Figure 1).
Figure 1: Entity-relationship diagram for the running example: Employment database.
Comprehensive Catalog of Error Types
Missing Data
Errors in the missing data category include:
- Explicit Missing Values (NULLs): Violate NOT NULL constraints or cause missing facts even in absence of hard constraints.
- Disguised Missing Values: Syntactically valid placeholders (“Unknown”, -99) that semantically denote missingness—challenging automated detection.
- Partial-Empty Tuples/Attributes: Tuples or attributes exceeding a threshold of missing values; relevant in wide tables and practical cleaning heuristics.
- Missing Tuples: Entire fact/row absence, detectable only against external real-world knowledge or expectations.
- Empty Attribute Columns: All entries missing in an attribute, often symptomatic of schema drift.
- Bias: Treated as an error indicator—systematic under- or over-representation of classes/groups causing invalid downstream statistical inference.
Incorrect Data
Incorrect data contains the broadest and most diverse set of error types. Highlights include:
- Invalid Values/Tuples: Violating basic domain, type, or semantic constraints.
- Textual Errors: Misspellings, typos, misscans (OCR errors), and incorrect encoding—particularly problematic in string attributes, names, and codes.
- Out-of-Vocabulary (OOV) Words: Values or tokens not present in allowed vocabularies (critical for NLP and ML applications).
- Synonyms and Word Transpositions: Semantically related or sequence-altered values undermining deduplication and integration.
- Noise: Unintended stochastic deviation, e.g., from sensors or human entry.
- Semantic Ambiguity: Attributes/tuples mapping to multiple real-world entities (homonyms, vague abbreviations).
- Outliers: Statistically anomalous values—flagged as error indicators.
- Syntax and Formatting Violations: Deviation from expected grammar or heterogeneous representations (e.g., different date formats).
- Incorrect Units: Mismatched units; impactful for downstream aggregation or ML if undetected.
- Incorrect/Invalid Reference: Syntactically “valid” foreign keys referencing the wrong external entities.
Rule Violations
- Constraint Violations: Classic integrity violations—domain, uniqueness, referential, or composite.
- (Conditional) Functional Dependency Violations: FD/CFD errors, particularly relevant in normalization and data integration.
- Cyclic Dependencies: Violations arising from schema misdesign, e.g., self-referential cycles.
- Business, Legal, DBA Rule Violations: Extra-logical (often metadata- or process-defined) rules not captured by standard DBMS constraints.
- Outdated Data: Data values that no longer reflect the current real-world state, a pervasive issue in transactional and time-dependent domains.
Redundant Data
Redundancy goes beyond duplicating information:
- Duplicate Tuples: Lexical or semantic duplicates, including fuzzy/approximate variants, hampering entity resolution and record linkage.
- Irrelevant Data: Entries that do not correspond to any legitimate real-world instance as defined by the database’s business requirements.
The catalog distinguishes between data-level errors and metadata errors, which affect schema elements (relation labels, attribute names, constraints). Metadata errors can either directly induce or mask data errors, thereby compounding detection and remediation complexity.
Moreover, not all problematic data characteristics are errors in a formal sense; some (e.g., irretrievability or aggregation difficulty) are domain- or system-related and require domain expertise to assess relevance or impact. The catalog deliberately separates such issues from concrete, directly detectable data errors.
Implications for Data Quality, Detection, and Cleaning
By formally categorizing errors, the catalog serves as an underpinning for practitioners and researchers targeting error detection and cleaning:
- Error-Specific Detection: Enables alignment of validation and profiling tools with error formalism, increasing coverage and interpretability.
- Error Correction Strategies: Informs decisions for imputation, deduplication, and constraint repair by allowing error-type-aware interventions.
- Tool/Benchmark Development: Standardizes ground-truth annotation, evaluation, and benchmarking of data cleaning and repair tools.
- AI/ML Readiness: Highlights error indicators particularly problematic for statistical inference (e.g., bias, outliers, DMVs, OOV). The catalog thus links classical DB community concerns with ML practice, where undetected data errors subtly degrade model validity, generalization, and fairness [Sedir2024].
The paper also discusses available detection and cleaning tools; modern solutions (e.g., LLM-based error detection) are highlighted as future directions [Zhang_2025].
Theoretical and Practical Impact
The main theoretical impact is rigorous, unambiguous formalization at the data level—enabling researchers to:
- Systematize evaluation of DQ interventions.
- Build fair and reproducible benchmarks.
- Explore relationships between error types, metadata errors, and high-level DQ dimensions (completeness, consistency, etc.).
- Investigate modality-transfer (e.g., how error types manifest in images/time series).
Practically, the catalog is positioned as a reference for data pipeline designers, data scientists, and enterprise DBAs. It is domain-agnostic for any tabular data, supporting the implementation of schema constraints, profiling, and type-aware cleaning. It provides a base for future work in extending error types to non-tabular modalities and for building automated error-type classification systems.
Future Directions
- Automated Error-Type Classification: The catalog can power supervised or LLM-based systems for fine-grained detection, a critical step for scalable and differentiated error remediation.
- Formalization of Metadata Errors: Needed to propagate DQ improvements from metadata integration and harmonization.
- Extension to Non-Tabular Modalities: Graph, sequence, and unstructured text require analogous, possibly modality-tailored error frameworks.
- Stronger Mapping to DQ Dimensions: Closer links to assessment and benchmarking in real-world analytic and ML scenarios.
Conclusion
This work provides a methodical and formal foundation for understanding and addressing the diverse universe of data-level errors in RDBs. The catalog's granularity and explicitness resolve outstanding ambiguities in terminology and taxonomy, enabling more robust, interpretable, and efficient data cleaning strategies—an essential foundation for trustworthy analytics and machine learning.
Reference:
- "A Catalog of Data Errors" (2604.09277)