Data Excellence for AI: Why Should You Care
Abstract: The efficacy of ML models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If "data is the new oil," we are still missing work on the refineries by which the data itself could be optimized for more effective use.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper argues that in AI, the data we use is just as important as the algorithms (the “smart” parts) we build. Think of AI like a car: the algorithm is the engine, but the data is both the fuel and the map. If the fuel is dirty or the map is wrong, even the best engine won’t get you safely where you want to go. The authors explain why we need “data excellence” and share lessons from a workshop that brought experts together to discuss how to improve the quality of data used in AI.
What questions did the authors ask?
To make the ideas easy to grasp, here are the main questions the paper explores:
- Why is data quality often ignored, and why does that hurt AI systems?
- How can we measure whether a dataset truly represents the real world (not just whether a model scores well on a test)?
- What properties make a dataset “excellent,” and how can we build and maintain such datasets?
- What can AI learn from the way software engineering focuses on quality and safety?
- What best practices, tools, and habits should the AI community adopt to avoid “data disasters”?
How did they study it?
Instead of running one experiment, the authors organized the 1st Data Excellence Workshop at a research conference (HCOMP 2020). They:
- Invited speakers from industry, academia, and government to share case studies and lessons learned about data problems and successes.
- Discussed and compared experiences to find common patterns and build a framework for “data excellence.”
- Reviewed three contributed research papers that examined issues like labeling bias, supporting diverse opinions in training data, and improving data practices in high‑stakes AI (like medicine).
You can think of this workshop like a “team huddle” for the AI community, where people brought examples of what goes wrong with data and how to fix it, then drafted a playbook everyone can use.
What did they find, and why is it important?
The workshop surfaced several important ideas. To keep it clear, here are the highlights explained with everyday examples:
The problem: We care too much about scores and too little about the test itself
- AI success today is often measured with numbers like Accuracy, F1, or AUC. These tell us how well a model fits a dataset (like how well you did on a practice test).
- But these numbers don’t tell us whether the dataset itself is a good reflection of real life. It’s like getting an A on an easy or unrealistic practice test—the score looks great, but you may still fail in the real world.
Real-world data is messy—and that matters
- Many datasets have “dirty” or biased data. If the input is wrong (“garbage in”), the output is wrong (“garbage out”).
- Benchmark datasets (common “practice tests” for AI) often remove disagreement or ambiguity among human labelers. That makes the test cleaner but less realistic, so models can appear better than they actually are and then fail on real, tricky cases.
A framework for data excellence: four pillars
To build better datasets, the authors propose focusing on four key properties:
- Maintainability: Keeping data healthy over time. Like maintaining a garden—data needs regular care, updates, and fixes as the world changes.
- Validity: Measuring the right thing. Like a bathroom scale—it should measure weight, not height. A valid dataset actually captures the real concept you care about.
- Reliability: Consistency and reproducibility. Like a thermometer that gives the same result for the same temperature each time. Others should be able to collect similar data and get similar results.
- Fidelity: How close the dataset is to reality. Like a detailed map that accurately shows roads, traffic, and obstacles. Sampling mistakes (e.g., mixing the same user’s data into both training and testing) can reduce fidelity and mislead models.
Lessons from software engineering
- Software matured by investing in code reviews, testing, documentation, and safe release processes. AI needs the same—but for data.
- The community should adopt standards, “unit tests” for datasets (checking both format and meaning of labels), and transparent documentation so others can understand, reuse, and improve datasets.
Examples from the talks and papers
- Chatbots trained on unlabeled internet text may learn toxic behavior—data quality and safety checks are essential.
- Medical data labeling is complex: it often requires expert graders, careful guidelines, and managers to maintain quality.
- “Annotation artifacts” happen when labelers’ habits unintentionally bias the data (like a person who clicks the same answer too often). Better workflows can reduce these artifacts.
- Training data should support diverse opinions when the “right” answer isn’t single and clear, so AI serves more users fairly.
- Datasets should be treated like engineered infrastructure: plan them, document them, and hold people accountable for quality.
What’s the potential impact?
If the AI community takes data excellence seriously, we can expect:
- Safer, fairer, and more reliable AI systems in high‑stakes areas like health, finance, and public services.
- Faster progress that isn’t just “better scores on benchmarks,” but real‑world improvements.
- Fewer “data disasters,” where models break because the dataset was flawed or outdated.
- A cultural shift: valuing data work (discovery, cleaning, documenting, maintaining) as much as model work, with new metrics that judge how well datasets reflect reality.
In short, this paper is a call to action: treat data with the same care and rigor as we treat code. Build, test, document, and maintain datasets so AI can be trustworthy and truly useful in the real world.
Knowledge Gaps
Below is a single, concise list of knowledge gaps, limitations, and open questions that remain unresolved in the paper. Each item is concrete and intended to guide future research.
- No standardized, validated metrics for “goodness-of-data” across validity, reliability, fidelity, and maintainability; need operational definitions, measurement protocols, and benchmark tasks to compare datasets.
- Lack of formal data validation test suites analogous to software unit/integration tests (e.g., automated checks for label semantics, sampling bias, leakage, annotator disagreement handling).
- Absence of methods to represent, retain, and leverage annotator disagreement, ambiguity, and multi-valence in both training and evaluation (beyond collapsing to a single “ground truth”).
- Missing specification languages/processes for data requirements that link constructs to sampling frames, labeling schemas, target distributions, quality thresholds, and reproducibility criteria.
- Limited infrastructure for treating datasets as living artifacts (versioning, dependency management, continuous data integration, change logs, and data regression testing).
- No robust monitoring and correction strategies to prevent distribution drift in model-in-the-loop or continuously updated datasets (including alarm thresholds and corrective sampling policies).
- Inadequate quantitative cost–benefit models for investments in data excellence (quality vs speed vs cost), including control levers and ROI estimation under real constraints.
- Unclear accountability, roles, and incentive structures for dataset development and maintenance within organizations and communities; need governance models and role definitions.
- No shared taxonomy and repository of data defects, failure modes, and “data catastrophes,” along with standardized evaluation tasks for data-quality interventions.
- Weak reproducibility standards for data collection pipelines; need replication packages detailing sampling seeds, recruitment strategies, labeling interfaces and instructions, workforce management, and environment configurations.
- Insufficient methods to measure dataset fidelity to real-world phenomena (external validity checks, coverage analyses, temporal/user leakage detection, and provenance auditing).
- Need practical, human-centered methodologies to quantify and improve annotation reliability beyond inter-annotator agreement (accounting for training, fatigue, incentives, and expertise).
- Limited strategies to reduce annotation artifacts while preserving task signal (workflow design, interface constraints, prompt phrasing), plus automated detection and remediation tools.
- Unclear approaches to synergize human judgment with low-shot/unsupervised/active learning during annotation while maintaining quality guarantees and auditability.
- No consensus on minimum viable dataset documentation and bias reporting (what is “enough” under deadlines, templates for bias discussion, maintenance and deprecation plans).
- Missing community processes for “relentless introspection” of datasets (review norms, auditing cycles, acceptance criteria for dataset papers, and community-of-use stewardship).
- Benchmarks rarely incorporate real-world ambiguity and hard cases; need protocols to include disagreement distributions, adversarial/rare events, and to report slice-wise performance and weak spots.
- Lack of task-grounded evaluation protocols that measure model performance against real-world outcomes rather than benchmark proxies; need linkage to end-user impact metrics.
- Few automated tools to detect and prevent data leakage (across time, users, and sources) and enforce proper splits; need provenance tracking and leakage tests integrated in pipelines.
- Underdeveloped practices for collecting and curating toxic or sensitive content (e.g., dialogue) that balance coverage, harm mitigation, annotator well-being, and ethical compliance.
- Domain-specific guidelines and checklists (e.g., medical labeling) are not formalized or validated; need structured roles (experts, workforce managers), training curricula, and effectiveness assessments.
- Limited methods for privacy-preserving data excellence that quantify how privacy mechanisms (e.g., differential privacy) affect validity, fidelity, and downstream utility.
- Scalability of data maintenance is unaddressed (staffing models, automation, KPIs for data health, incident response, and budgeting for long-term upkeep).
- No methodology to disentangle contributions of data improvements versus model changes; need experimental designs and reporting standards to attribute performance gains to data excellence.
- Insufficient practices to ensure dataset reusability across applications (modular schemas, intended-use statements, composability, licensing clarity, and interoperability standards).
- Lack of standardized reporting for dataset pipeline architectures as engineered infrastructure (diagrams, test coverage, monitoring, failure handling, and change management).
- Few comparative evaluations of existing datasets against the proposed properties (maintainability, reliability, validity, fidelity) to empirically validate the data excellence framework.
- Limited mechanisms for proactively discovering “unknown unknowns” via systematic probing, targeted adversarial data collection, coverage metrics, and exploratory sampling.
- Incomplete stakeholder analysis for data fidelity concerns (who cares, how requirements differ across stakeholders, and how to reconcile competing priorities).
- Weak funding and publication incentives for data excellence work; need review criteria, recognition mechanisms, and institutional policies that reward dataset quality and maintenance.
Practical Applications
Practical Applications Derived from the Paper
The following lists synthesize actionable, real-world applications of “Data Excellence for AI” across industry, academia, policy, and daily life. Each item denotes sector(s), suggested tools/products/workflows, and key assumptions or dependencies.
Immediate Applications
These can be piloted or deployed with current practices and tooling.
- Data unit tests for labels and datasets (software, healthcare, education)
- Tools/Workflows: Create “dataset unit tests” analogous to software tests (syntax checks for schema conformity; semantic checks for label meaning/magnitude; cross-split leakage checks).
- Assumptions/Dependencies: Clear label schema, domain guidelines, and CI/CD integration for data.
- Disagreement-aware dataset construction and evaluation (NLP, recommender systems, content moderation)
- Tools/Workflows: Preserve annotator disagreement instead of collapsing to a single “ground truth”; train and evaluate with uncertainty-aware labels; use disagreement dashboards.
- Assumptions/Dependencies: Labeling platforms support multi-annotator storage; models can consume soft labels or distributions.
- Adversarial data collection to uncover model “weak spots” (software, robotics, chatbots)
- Tools/Workflows: Red-teaming datasets; targeted sampling of edge cases; model-in-the-loop data acquisition that explicitly seeks failure modes.
- Assumptions/Dependencies: Budget for ongoing data collection; evaluation harnesses and safety guardrails.
- Data fidelity checks for sampling and split design (healthcare time series, energy forecasting, finance, robotics)
- Tools/Workflows: Enforce temporal splits to avoid leakage; separate user-based data across train/test; audit representativeness by geography, demographics, device, and time.
- Assumptions/Dependencies: Reliable timestamps, user/device identifiers, and access to stratification attributes.
- Operational data documentation (all sectors)
- Tools/Workflows: Lightweight dataset design documents that capture scope, collection process, known biases, maintenance plan, and intended use; versioned documentation as the dataset evolves.
- Assumptions/Dependencies: Team time allocation; governance requirements; privacy reviews.
- Data versioning and lineage for reproducibility (finance, healthcare, regulated industries)
- Tools/Workflows: Data catalogs; dataset “Git” with immutable snapshots, lineage graphs, and reproducible pipelines; tie experiment artifacts to dataset versions.
- Assumptions/Dependencies: Storage/infrastructure; standardized metadata; organizational buy-in.
- Expert labeling operations for high-stakes domains (healthcare, legal, safety-critical manufacturing)
- Tools/Workflows: Workforce managers to coordinate experts; graded guidelines with training; adjudication protocols; audit trails of decision rationale.
- Assumptions/Dependencies: Access to domain experts; secure environments; compensation and training budgets.
- Toxicity filtering and curation for dialogue corpora (chatbots, customer support)
- Tools/Workflows: Pre-filter training corpora for offensive content; continual re-evaluation with “relentless introspection”; user-safety policies.
- Assumptions/Dependencies: High-quality toxicity detectors and multilingual coverage; policy definition for acceptable content.
- Data cost–quality–speed trade-off dashboards (data platforms, enterprise ML)
- Tools/Workflows: Portfolio-level levers (budget allocation, sourcing methods) and dataset-level levers (sampling, annotation depth); visualize trade-offs and impacts on downstream model quality.
- Assumptions/Dependencies: Instrumentation of data pipelines; agreed-upon quality KPIs.
- Dataset post-mortems and governance processes (all sectors)
- Tools/Workflows: Structured reviews after incidents or model failures to identify data defects and cascades; risk registers tied to datasets.
- Assumptions/Dependencies: Psychological safety and culture; incident management discipline.
- Community-of-use dataset building and introspection (information retrieval, academic consortia)
- Tools/Workflows: TREC-style evaluation cycles; shared benchmarks with open critiques; community-maintained issue trackers.
- Assumptions/Dependencies: Funding and coordination; shared evaluation infrastructure.
- Organizational incentives and KPIs for data excellence (management, product organizations)
- Tools/Workflows: Incorporate data reliability, fidelity, and documentation completeness into OKRs and performance reviews.
- Assumptions/Dependencies: Leadership endorsement; measurable and agreed-upon targets.
- Reducing annotation artifacts through workflow design (NLP, computer vision)
- Tools/Workflows: A/B test prompt/instruction variants; randomized task orders; quality gates and reviewer rotation; capture annotator metadata to control confounds.
- Assumptions/Dependencies: Labeling platform configurability; experiment analytics.
- Training for diversity of opinion in labeled data (personalization, civic tech, education)
- Tools/Workflows: Collect and retain multiple valid viewpoints; multi-objective training to serve diverse user personas; configurable policy layers for deployment contexts.
- Assumptions/Dependencies: Ethical review; product requirement clarity on pluralism vs standardization.
Long-Term Applications
These require further research, scaling, standards development, or broader coordination.
- Standardized “goodness-of-data” metrics and benchmarks (academia, standards bodies)
- Tools/Products: Metric suite for validity, reliability, and fidelity; reference benchmarks that score datasets along these dimensions.
- Assumptions/Dependencies: Community consensus; cross-domain applicability; empirical validation.
- Dataset certification programs for high-stakes AI (healthcare, finance, public sector)
- Tools/Products: Third-party audits and certifications for data practices, documentation, and ongoing maintenance; compliance checklists.
- Assumptions/Dependencies: Regulatory alignment; accrediting bodies; enforceable policies.
- Living benchmarks with anti-drift governance (NLP, vision, speech)
- Tools/Products: Benchmarks that evolve with models-in-the-loop; distribution monitoring and guardrails to prevent drift away from task reality.
- Assumptions/Dependencies: Sustained funding; participation incentives; versioning and change-log standards.
- Data observability platforms tailored to ML datasets (software, MLOps)
- Tools/Products: SaaS platforms monitoring label distributions, disagreement rates, leakage risks, fidelity drift, and data lineage; alerting integrated with MLOps.
- Assumptions/Dependencies: Integration into diverse data stacks; privacy-preserving telemetry.
- Human–algorithm hybrid annotation systems (all sectors)
- Tools/Products: Active learning with expert adjudication; low-shot/unsupervised pre-labeling triaged by humans; uncertainty-targeted sampling.
- Assumptions/Dependencies: Reliable uncertainty estimation; scalable expert workflows; ROI demonstration.
- Portfolio-level dataset management and optimization (large enterprises)
- Tools/Products: Decision support for budget allocation across datasets; simulation of cost–quality–speed outcomes; multi-dataset dependency mapping.
- Assumptions/Dependencies: Enterprise-wide data inventory; executive sponsorship.
- Bias detection and remediation embedded in data pipelines (healthcare, finance, hiring)
- Tools/Products: Fairness monitors leveraging disagreement and fidelity analyses; automated bias tests by subgroup; bias-aware re-sampling and data augmentation.
- Assumptions/Dependencies: Access to ethically obtained demographic proxies; legal compliance; stakeholder oversight.
- Policy mandates for dataset documentation and reproducibility (government, public institutions)
- Tools/Products: Policy frameworks requiring dataset design docs, lineage, and reproducibility artifacts for funded AI projects or public services.
- Assumptions/Dependencies: Legislative process; harmonization with privacy and security laws; enforcement mechanisms.
- “Dataset engineering” curricula and professional roles (education, industry)
- Tools/Products: University courses and certifications on reliability, validity, fidelity, maintainability; formal roles (Dataset Engineer, Data Steward).
- Assumptions/Dependencies: Academic program adoption; job market recognition.
- Consumer-facing dataset provenance disclosures (daily life, consumer software)
- Tools/Products: UI labels showing dataset sources, time ranges, and bias safeguards for AI features (e.g., chatbots, recommendations).
- Assumptions/Dependencies: Standardized disclosure formats; vendor willingness; usability research.
- External validity measurement frameworks (academia, product evaluation)
- Tools/Products: Methodologies to measure performance on the underlying real-world task vs benchmark; field trials and post-deployment studies.
- Assumptions/Dependencies: Access to real-world outcomes; IRB/ethical approvals; longitudinal tracking.
- Data cascade risk management in high-stakes AI (healthcare, public safety)
- Tools/Products: Risk registers and scenario planning for upstream data defects propagating downstream; playbooks for detection, triage, and remediation.
- Assumptions/Dependencies: Cross-functional risk ownership; incident reporting culture; continuous monitoring.
Glossary
- A/B Testing: A method to compare two versions of a dataset or model to determine which performs better. Example: "Measurement of AI success today is often metrics-driven, with emphasis on rigorous model measurement and A/B testing."
- Annotation Artifacts: Bias in datasets where annotations capture idiosyncrasies irrelevant to the task. Example: "Han et al. 2020 define 'annotation artifacts' as a type of dataset bias in which annotations capture workers' idiosyncrasies that are irrelevant to the task itself."
- Benchmark Datasets: Predefined datasets used to evaluate the performance of AI models. Example: "Benchmark datasets define the entire world within which models exist and operate."
- Data Cascades: Issues that compound through the stages of data collection, processing, and model training. Example: "...Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI."
- Data Fidelity: The degree to which a dataset accurately represents the real world. Example: "Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity..."
- Data Validation: Mechanisms to rigorously assess data quality to ensure model reliability. Example: "... importance of rigorously managing data quality using mechanisms specific to data validation..."
- Dataset Drift: When a dataset evolves over time and deviates from its original distribution. Example: "...how can we prevent the dataset from drifting away from some reasonable distribution for the original task?"
- Goodness-of-Fit Metrics: Statistical measures used to assess how well a model fits a dataset, such as F1, Accuracy, and AUC. Example: "Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity..."
- Human-Annotated Data: Data labeled by humans, often used for training and testing in machine learning. Example: "As human-annotated data represents the compass that the entire ML community relies on..."
- Maintainability: The ease with which a dataset can be kept up-to-date and usable over time. Example: "Maintainability: Maintaining data at scale has similar challenges as maintaining software at scale."
- Operational Validity: The suitability of data to represent the intended phenomenon accurately. Example: "For datasets to have operational validity we need to know whether they account for potential complexity, subjectivity..."
- Reproducibility: The ability to replicate results using the same dataset and methodology. Example: "Reliability captures internal aspects of data validity, such as: consistency, replicability, reproducibility of data."
- Weak Spots: Classes of examples difficult or impossible for a model to evaluate accurately due to their absence from the dataset. Example: "ML models become prone to develop 'weak spots', i.e., classes of examples that are difficult or impossible..."
- Reliability: The consistency and quality of data, ensuring reproducible and accurate results. Example: "Reliability captures internal aspects of data validity, such as: consistency, replicability, reproducibility of data."
Collections
Sign up for free to add this paper to one or more collections.