Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Excellence for AI: Why Should You Care

Published 19 Nov 2021 in cs.LG and cs.AI | (2111.10391v1)

Abstract: The efficacy of ML models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If "data is the new oil," we are still missing work on the refineries by which the data itself could be optimized for more effective use.

Citations (23)

Summary

  • The paper demonstrates that data quality is the critical factor in AI performance, challenging the convention that algorithm improvements alone drive success.
  • The paper outlines key challenges in measuring data fidelity, focusing on issues such as bias, fairness, and reproducibility in datasets.
  • The paper advocates for adopting software engineering practices, like rigorous testing and living dataset management, to ensure data reliability and transparency.

Importance of Data Excellence in Artificial Intelligence

The paper "Data Excellence for AI: Why Should You Care" (2111.10391) examines the pivotal role of data quality in advancing the field of AI and ML. It argues that while algorithmic advancements have witnessed extensive research and development, the equally crucial role of data has not been given the same attention, often relegated to mere preprocessing work perceived as uninspiring. This essay synthesizes the paper's insights into data quality challenges, the implications for AI systems, and strategies for fostering data excellence.

The Overlooked Importance of Data

The paper emphasizes that the quality of training datasets fundamentally determines the effectiveness of AI models. Despite well-established practices for algorithmic development, parallel frameworks for data, including rigorous testing and validation standards, are lacking. The common perception of data tasks as tedious or secondary has led to significant issues, with real-world datasets often being "dirty" and plagued by quality problems. The concept of "garbage in, garbage out" highlights that poor data quality directly translates into suboptimal model performance. The authors stress that data quality assessment should not rely merely on model performance metrics but requires dedicated validation processes akin to those in software engineering.

Challenges in Measuring Data Quality

Data quality is complex and frequently misunderstood. The paper identifies the challenges of measuring dataset quality, pointing out issues related to fairness, bias, lack of documentation, and reproducibility concerns. Standard metrics like F1 score, accuracy, and AUC are insufficient for evaluating data fidelity and validity; hence, there is an evident gap in standardized measures to assess "goodness-of-data." This inadequacy can lead to ML models that fail to generalize to real-world data due to missing or unrepresentative instances in benchmark datasets.

Lessons from Software Engineering

The authors draw an analogy between software engineering and data management, suggesting that the rigorous processes for ensuring software quality could inspire similar approaches for data quality. For example, just as systematic code reviews and debugging are integral to software development, analogous practices could ensure data reliability and reproducibility. The workshop discussions also proposed that best practices and infrastructures for maintaining "living" datasets could mitigate risks associated with static or outdated data repositories.

Framework for Data Excellence

The workshop outlined in the paper highlighted several properties critical for achieving data excellence: maintainability, reliability, validity, and fidelity. These include ensuring datasets accurately represent the phenomenon of interest and remain consistent and reusable over time. Moreover, the authors advocate for documenting datasets extensively to expose biases and enhance the transparency of data pipelines. Through case studies and expert presentations, they underscore the necessity of viewing datasets as critical infrastructure that require as much attention as the algorithmic models they inform.

Future Directions and Conclusion

The implications of data excellence extend beyond improved model performance to ethical and societal impacts. As AI systems become more prevalent in high-stakes environments, the potential for "data catastrophes" arising from poor-quality datasets increases. Developing comprehensive frameworks and rigorously evaluating the data underpinning these systems are essential steps toward mitigating such risks. Looking forward, the paper advocates for a cultural shift within the AI community to prioritize data excellence, thereby accelerating the breadth and depth of AI research and deployment.

In conclusion, the paper "Data Excellence for AI: Why Should You Care" emphasizes a shift in focus toward the quality and management of datasets used in AI and ML, advocating for practices inspired by established software engineering disciplines. With appropriate frameworks and industry collaboration, the next phase of AI advancements can be grounded in data that is accurate, reliable, and genuinely reflective of the complexities of real-world phenomena.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper argues that in AI, the data we use is just as important as the algorithms (the “smart” parts) we build. Think of AI like a car: the algorithm is the engine, but the data is both the fuel and the map. If the fuel is dirty or the map is wrong, even the best engine won’t get you safely where you want to go. The authors explain why we need “data excellence” and share lessons from a workshop that brought experts together to discuss how to improve the quality of data used in AI.

What questions did the authors ask?

To make the ideas easy to grasp, here are the main questions the paper explores:

  • Why is data quality often ignored, and why does that hurt AI systems?
  • How can we measure whether a dataset truly represents the real world (not just whether a model scores well on a test)?
  • What properties make a dataset “excellent,” and how can we build and maintain such datasets?
  • What can AI learn from the way software engineering focuses on quality and safety?
  • What best practices, tools, and habits should the AI community adopt to avoid “data disasters”?

How did they study it?

Instead of running one experiment, the authors organized the 1st Data Excellence Workshop at a research conference (HCOMP 2020). They:

  • Invited speakers from industry, academia, and government to share case studies and lessons learned about data problems and successes.
  • Discussed and compared experiences to find common patterns and build a framework for “data excellence.”
  • Reviewed three contributed research papers that examined issues like labeling bias, supporting diverse opinions in training data, and improving data practices in high‑stakes AI (like medicine).

You can think of this workshop like a “team huddle” for the AI community, where people brought examples of what goes wrong with data and how to fix it, then drafted a playbook everyone can use.

What did they find, and why is it important?

The workshop surfaced several important ideas. To keep it clear, here are the highlights explained with everyday examples:

The problem: We care too much about scores and too little about the test itself

  • AI success today is often measured with numbers like Accuracy, F1, or AUC. These tell us how well a model fits a dataset (like how well you did on a practice test).
  • But these numbers don’t tell us whether the dataset itself is a good reflection of real life. It’s like getting an A on an easy or unrealistic practice test—the score looks great, but you may still fail in the real world.

Real-world data is messy—and that matters

  • Many datasets have “dirty” or biased data. If the input is wrong (“garbage in”), the output is wrong (“garbage out”).
  • Benchmark datasets (common “practice tests” for AI) often remove disagreement or ambiguity among human labelers. That makes the test cleaner but less realistic, so models can appear better than they actually are and then fail on real, tricky cases.

A framework for data excellence: four pillars

To build better datasets, the authors propose focusing on four key properties:

  • Maintainability: Keeping data healthy over time. Like maintaining a garden—data needs regular care, updates, and fixes as the world changes.
  • Validity: Measuring the right thing. Like a bathroom scale—it should measure weight, not height. A valid dataset actually captures the real concept you care about.
  • Reliability: Consistency and reproducibility. Like a thermometer that gives the same result for the same temperature each time. Others should be able to collect similar data and get similar results.
  • Fidelity: How close the dataset is to reality. Like a detailed map that accurately shows roads, traffic, and obstacles. Sampling mistakes (e.g., mixing the same user’s data into both training and testing) can reduce fidelity and mislead models.

Lessons from software engineering

  • Software matured by investing in code reviews, testing, documentation, and safe release processes. AI needs the same—but for data.
  • The community should adopt standards, “unit tests” for datasets (checking both format and meaning of labels), and transparent documentation so others can understand, reuse, and improve datasets.

Examples from the talks and papers

  • Chatbots trained on unlabeled internet text may learn toxic behavior—data quality and safety checks are essential.
  • Medical data labeling is complex: it often requires expert graders, careful guidelines, and managers to maintain quality.
  • “Annotation artifacts” happen when labelers’ habits unintentionally bias the data (like a person who clicks the same answer too often). Better workflows can reduce these artifacts.
  • Training data should support diverse opinions when the “right” answer isn’t single and clear, so AI serves more users fairly.
  • Datasets should be treated like engineered infrastructure: plan them, document them, and hold people accountable for quality.

What’s the potential impact?

If the AI community takes data excellence seriously, we can expect:

  • Safer, fairer, and more reliable AI systems in high‑stakes areas like health, finance, and public services.
  • Faster progress that isn’t just “better scores on benchmarks,” but real‑world improvements.
  • Fewer “data disasters,” where models break because the dataset was flawed or outdated.
  • A cultural shift: valuing data work (discovery, cleaning, documenting, maintaining) as much as model work, with new metrics that judge how well datasets reflect reality.

In short, this paper is a call to action: treat data with the same care and rigor as we treat code. Build, test, document, and maintain datasets so AI can be trustworthy and truly useful in the real world.

Knowledge Gaps

Below is a single, concise list of knowledge gaps, limitations, and open questions that remain unresolved in the paper. Each item is concrete and intended to guide future research.

  • No standardized, validated metrics for “goodness-of-data” across validity, reliability, fidelity, and maintainability; need operational definitions, measurement protocols, and benchmark tasks to compare datasets.
  • Lack of formal data validation test suites analogous to software unit/integration tests (e.g., automated checks for label semantics, sampling bias, leakage, annotator disagreement handling).
  • Absence of methods to represent, retain, and leverage annotator disagreement, ambiguity, and multi-valence in both training and evaluation (beyond collapsing to a single “ground truth”).
  • Missing specification languages/processes for data requirements that link constructs to sampling frames, labeling schemas, target distributions, quality thresholds, and reproducibility criteria.
  • Limited infrastructure for treating datasets as living artifacts (versioning, dependency management, continuous data integration, change logs, and data regression testing).
  • No robust monitoring and correction strategies to prevent distribution drift in model-in-the-loop or continuously updated datasets (including alarm thresholds and corrective sampling policies).
  • Inadequate quantitative cost–benefit models for investments in data excellence (quality vs speed vs cost), including control levers and ROI estimation under real constraints.
  • Unclear accountability, roles, and incentive structures for dataset development and maintenance within organizations and communities; need governance models and role definitions.
  • No shared taxonomy and repository of data defects, failure modes, and “data catastrophes,” along with standardized evaluation tasks for data-quality interventions.
  • Weak reproducibility standards for data collection pipelines; need replication packages detailing sampling seeds, recruitment strategies, labeling interfaces and instructions, workforce management, and environment configurations.
  • Insufficient methods to measure dataset fidelity to real-world phenomena (external validity checks, coverage analyses, temporal/user leakage detection, and provenance auditing).
  • Need practical, human-centered methodologies to quantify and improve annotation reliability beyond inter-annotator agreement (accounting for training, fatigue, incentives, and expertise).
  • Limited strategies to reduce annotation artifacts while preserving task signal (workflow design, interface constraints, prompt phrasing), plus automated detection and remediation tools.
  • Unclear approaches to synergize human judgment with low-shot/unsupervised/active learning during annotation while maintaining quality guarantees and auditability.
  • No consensus on minimum viable dataset documentation and bias reporting (what is “enough” under deadlines, templates for bias discussion, maintenance and deprecation plans).
  • Missing community processes for “relentless introspection” of datasets (review norms, auditing cycles, acceptance criteria for dataset papers, and community-of-use stewardship).
  • Benchmarks rarely incorporate real-world ambiguity and hard cases; need protocols to include disagreement distributions, adversarial/rare events, and to report slice-wise performance and weak spots.
  • Lack of task-grounded evaluation protocols that measure model performance against real-world outcomes rather than benchmark proxies; need linkage to end-user impact metrics.
  • Few automated tools to detect and prevent data leakage (across time, users, and sources) and enforce proper splits; need provenance tracking and leakage tests integrated in pipelines.
  • Underdeveloped practices for collecting and curating toxic or sensitive content (e.g., dialogue) that balance coverage, harm mitigation, annotator well-being, and ethical compliance.
  • Domain-specific guidelines and checklists (e.g., medical labeling) are not formalized or validated; need structured roles (experts, workforce managers), training curricula, and effectiveness assessments.
  • Limited methods for privacy-preserving data excellence that quantify how privacy mechanisms (e.g., differential privacy) affect validity, fidelity, and downstream utility.
  • Scalability of data maintenance is unaddressed (staffing models, automation, KPIs for data health, incident response, and budgeting for long-term upkeep).
  • No methodology to disentangle contributions of data improvements versus model changes; need experimental designs and reporting standards to attribute performance gains to data excellence.
  • Insufficient practices to ensure dataset reusability across applications (modular schemas, intended-use statements, composability, licensing clarity, and interoperability standards).
  • Lack of standardized reporting for dataset pipeline architectures as engineered infrastructure (diagrams, test coverage, monitoring, failure handling, and change management).
  • Few comparative evaluations of existing datasets against the proposed properties (maintainability, reliability, validity, fidelity) to empirically validate the data excellence framework.
  • Limited mechanisms for proactively discovering “unknown unknowns” via systematic probing, targeted adversarial data collection, coverage metrics, and exploratory sampling.
  • Incomplete stakeholder analysis for data fidelity concerns (who cares, how requirements differ across stakeholders, and how to reconcile competing priorities).
  • Weak funding and publication incentives for data excellence work; need review criteria, recognition mechanisms, and institutional policies that reward dataset quality and maintenance.

Practical Applications

Practical Applications Derived from the Paper

The following lists synthesize actionable, real-world applications of “Data Excellence for AI” across industry, academia, policy, and daily life. Each item denotes sector(s), suggested tools/products/workflows, and key assumptions or dependencies.

Immediate Applications

These can be piloted or deployed with current practices and tooling.

  • Data unit tests for labels and datasets (software, healthcare, education)
    • Tools/Workflows: Create “dataset unit tests” analogous to software tests (syntax checks for schema conformity; semantic checks for label meaning/magnitude; cross-split leakage checks).
    • Assumptions/Dependencies: Clear label schema, domain guidelines, and CI/CD integration for data.
  • Disagreement-aware dataset construction and evaluation (NLP, recommender systems, content moderation)
    • Tools/Workflows: Preserve annotator disagreement instead of collapsing to a single “ground truth”; train and evaluate with uncertainty-aware labels; use disagreement dashboards.
    • Assumptions/Dependencies: Labeling platforms support multi-annotator storage; models can consume soft labels or distributions.
  • Adversarial data collection to uncover model “weak spots” (software, robotics, chatbots)
    • Tools/Workflows: Red-teaming datasets; targeted sampling of edge cases; model-in-the-loop data acquisition that explicitly seeks failure modes.
    • Assumptions/Dependencies: Budget for ongoing data collection; evaluation harnesses and safety guardrails.
  • Data fidelity checks for sampling and split design (healthcare time series, energy forecasting, finance, robotics)
    • Tools/Workflows: Enforce temporal splits to avoid leakage; separate user-based data across train/test; audit representativeness by geography, demographics, device, and time.
    • Assumptions/Dependencies: Reliable timestamps, user/device identifiers, and access to stratification attributes.
  • Operational data documentation (all sectors)
    • Tools/Workflows: Lightweight dataset design documents that capture scope, collection process, known biases, maintenance plan, and intended use; versioned documentation as the dataset evolves.
    • Assumptions/Dependencies: Team time allocation; governance requirements; privacy reviews.
  • Data versioning and lineage for reproducibility (finance, healthcare, regulated industries)
    • Tools/Workflows: Data catalogs; dataset “Git” with immutable snapshots, lineage graphs, and reproducible pipelines; tie experiment artifacts to dataset versions.
    • Assumptions/Dependencies: Storage/infrastructure; standardized metadata; organizational buy-in.
  • Expert labeling operations for high-stakes domains (healthcare, legal, safety-critical manufacturing)
    • Tools/Workflows: Workforce managers to coordinate experts; graded guidelines with training; adjudication protocols; audit trails of decision rationale.
    • Assumptions/Dependencies: Access to domain experts; secure environments; compensation and training budgets.
  • Toxicity filtering and curation for dialogue corpora (chatbots, customer support)
    • Tools/Workflows: Pre-filter training corpora for offensive content; continual re-evaluation with “relentless introspection”; user-safety policies.
    • Assumptions/Dependencies: High-quality toxicity detectors and multilingual coverage; policy definition for acceptable content.
  • Data cost–quality–speed trade-off dashboards (data platforms, enterprise ML)
    • Tools/Workflows: Portfolio-level levers (budget allocation, sourcing methods) and dataset-level levers (sampling, annotation depth); visualize trade-offs and impacts on downstream model quality.
    • Assumptions/Dependencies: Instrumentation of data pipelines; agreed-upon quality KPIs.
  • Dataset post-mortems and governance processes (all sectors)
    • Tools/Workflows: Structured reviews after incidents or model failures to identify data defects and cascades; risk registers tied to datasets.
    • Assumptions/Dependencies: Psychological safety and culture; incident management discipline.
  • Community-of-use dataset building and introspection (information retrieval, academic consortia)
    • Tools/Workflows: TREC-style evaluation cycles; shared benchmarks with open critiques; community-maintained issue trackers.
    • Assumptions/Dependencies: Funding and coordination; shared evaluation infrastructure.
  • Organizational incentives and KPIs for data excellence (management, product organizations)
    • Tools/Workflows: Incorporate data reliability, fidelity, and documentation completeness into OKRs and performance reviews.
    • Assumptions/Dependencies: Leadership endorsement; measurable and agreed-upon targets.
  • Reducing annotation artifacts through workflow design (NLP, computer vision)
    • Tools/Workflows: A/B test prompt/instruction variants; randomized task orders; quality gates and reviewer rotation; capture annotator metadata to control confounds.
    • Assumptions/Dependencies: Labeling platform configurability; experiment analytics.
  • Training for diversity of opinion in labeled data (personalization, civic tech, education)
    • Tools/Workflows: Collect and retain multiple valid viewpoints; multi-objective training to serve diverse user personas; configurable policy layers for deployment contexts.
    • Assumptions/Dependencies: Ethical review; product requirement clarity on pluralism vs standardization.

Long-Term Applications

These require further research, scaling, standards development, or broader coordination.

  • Standardized “goodness-of-data” metrics and benchmarks (academia, standards bodies)
    • Tools/Products: Metric suite for validity, reliability, and fidelity; reference benchmarks that score datasets along these dimensions.
    • Assumptions/Dependencies: Community consensus; cross-domain applicability; empirical validation.
  • Dataset certification programs for high-stakes AI (healthcare, finance, public sector)
    • Tools/Products: Third-party audits and certifications for data practices, documentation, and ongoing maintenance; compliance checklists.
    • Assumptions/Dependencies: Regulatory alignment; accrediting bodies; enforceable policies.
  • Living benchmarks with anti-drift governance (NLP, vision, speech)
    • Tools/Products: Benchmarks that evolve with models-in-the-loop; distribution monitoring and guardrails to prevent drift away from task reality.
    • Assumptions/Dependencies: Sustained funding; participation incentives; versioning and change-log standards.
  • Data observability platforms tailored to ML datasets (software, MLOps)
    • Tools/Products: SaaS platforms monitoring label distributions, disagreement rates, leakage risks, fidelity drift, and data lineage; alerting integrated with MLOps.
    • Assumptions/Dependencies: Integration into diverse data stacks; privacy-preserving telemetry.
  • Human–algorithm hybrid annotation systems (all sectors)
    • Tools/Products: Active learning with expert adjudication; low-shot/unsupervised pre-labeling triaged by humans; uncertainty-targeted sampling.
    • Assumptions/Dependencies: Reliable uncertainty estimation; scalable expert workflows; ROI demonstration.
  • Portfolio-level dataset management and optimization (large enterprises)
    • Tools/Products: Decision support for budget allocation across datasets; simulation of cost–quality–speed outcomes; multi-dataset dependency mapping.
    • Assumptions/Dependencies: Enterprise-wide data inventory; executive sponsorship.
  • Bias detection and remediation embedded in data pipelines (healthcare, finance, hiring)
    • Tools/Products: Fairness monitors leveraging disagreement and fidelity analyses; automated bias tests by subgroup; bias-aware re-sampling and data augmentation.
    • Assumptions/Dependencies: Access to ethically obtained demographic proxies; legal compliance; stakeholder oversight.
  • Policy mandates for dataset documentation and reproducibility (government, public institutions)
    • Tools/Products: Policy frameworks requiring dataset design docs, lineage, and reproducibility artifacts for funded AI projects or public services.
    • Assumptions/Dependencies: Legislative process; harmonization with privacy and security laws; enforcement mechanisms.
  • “Dataset engineering” curricula and professional roles (education, industry)
    • Tools/Products: University courses and certifications on reliability, validity, fidelity, maintainability; formal roles (Dataset Engineer, Data Steward).
    • Assumptions/Dependencies: Academic program adoption; job market recognition.
  • Consumer-facing dataset provenance disclosures (daily life, consumer software)
    • Tools/Products: UI labels showing dataset sources, time ranges, and bias safeguards for AI features (e.g., chatbots, recommendations).
    • Assumptions/Dependencies: Standardized disclosure formats; vendor willingness; usability research.
  • External validity measurement frameworks (academia, product evaluation)
    • Tools/Products: Methodologies to measure performance on the underlying real-world task vs benchmark; field trials and post-deployment studies.
    • Assumptions/Dependencies: Access to real-world outcomes; IRB/ethical approvals; longitudinal tracking.
  • Data cascade risk management in high-stakes AI (healthcare, public safety)
    • Tools/Products: Risk registers and scenario planning for upstream data defects propagating downstream; playbooks for detection, triage, and remediation.
    • Assumptions/Dependencies: Cross-functional risk ownership; incident reporting culture; continuous monitoring.

Glossary

  • A/B Testing: A method to compare two versions of a dataset or model to determine which performs better. Example: "Measurement of AI success today is often metrics-driven, with emphasis on rigorous model measurement and A/B testing."
  • Annotation Artifacts: Bias in datasets where annotations capture idiosyncrasies irrelevant to the task. Example: "Han et al. 2020 define 'annotation artifacts' as a type of dataset bias in which annotations capture workers' idiosyncrasies that are irrelevant to the task itself."
  • Benchmark Datasets: Predefined datasets used to evaluate the performance of AI models. Example: "Benchmark datasets define the entire world within which models exist and operate."
  • Data Cascades: Issues that compound through the stages of data collection, processing, and model training. Example: "...Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI."
  • Data Fidelity: The degree to which a dataset accurately represents the real world. Example: "Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity..."
  • Data Validation: Mechanisms to rigorously assess data quality to ensure model reliability. Example: "... importance of rigorously managing data quality using mechanisms specific to data validation..."
  • Dataset Drift: When a dataset evolves over time and deviates from its original distribution. Example: "...how can we prevent the dataset from drifting away from some reasonable distribution for the original task?"
  • Goodness-of-Fit Metrics: Statistical measures used to assess how well a model fits a dataset, such as F1, Accuracy, and AUC. Example: "Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity..."
  • Human-Annotated Data: Data labeled by humans, often used for training and testing in machine learning. Example: "As human-annotated data represents the compass that the entire ML community relies on..."
  • Maintainability: The ease with which a dataset can be kept up-to-date and usable over time. Example: "Maintainability: Maintaining data at scale has similar challenges as maintaining software at scale."
  • Operational Validity: The suitability of data to represent the intended phenomenon accurately. Example: "For datasets to have operational validity we need to know whether they account for potential complexity, subjectivity..."
  • Reproducibility: The ability to replicate results using the same dataset and methodology. Example: "Reliability captures internal aspects of data validity, such as: consistency, replicability, reproducibility of data."
  • Weak Spots: Classes of examples difficult or impossible for a model to evaluate accurately due to their absence from the dataset. Example: "ML models become prone to develop 'weak spots', i.e., classes of examples that are difficult or impossible..."
  • Reliability: The consistency and quality of data, ensuring reproducible and accurate results. Example: "Reliability captures internal aspects of data validity, such as: consistency, replicability, reproducibility of data."

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.