Dataset Construction: Scale, Diversity & Quality
- Dataset construction is the process of curating data that balances scale (volume), diversity (breadth), and quality (accuracy) for robust machine learning.
- Recent advances apply multi-objective optimization and decorrelation pipelines to jointly maximize quantitative metrics and downstream performance.
- Empirical studies show that prioritizing quality and diversity over sheer volume leads to improved annotation fidelity and generalization.
The construction of datasets for machine learning is governed by three principal, interdependent axes: scale (volume of data), diversity (breadth and heterogeneity of content or structure), and quality (factual accuracy, annotation fidelity, and alignment with task objectives). Recent advances in large model pre-training and domain-specific applications have both driven and benefited from rigorous approaches to maximizing these traits jointly. This article surveys foundational concepts, algorithmic methodologies, quantitative metrics, recent best-practice pipelines, and the measurement of trade-offs across these axes, drawing on leading work from language, vision, tabular, speech, code, and multimodal domains.
1. Foundations: Definitions and Quantitative Metrics
Scale is classically defined as the number of records, examples, or total tokens/images/audio hours, but also encompasses the coverage of unique entities such as languages (Yu et al., 2022), object categories (Duan et al., 2022), or programming problems (Puri et al., 2021). In multilingual NLP, scale for a language is , with reported medians near 1 and wide resource disparities (Yu et al., 2022).
Diversity captures the variability and coverage of data with respect to latent concepts, domains, input modalities, features, or demographic attributes. Recent formalizations include:
- Diversity coefficient: the expected cosine distance between Fisher information–based Task2Vec embeddings of batch samples, bounded in . For language, typical pre-training corpora exhibit –$0.25$ (on a theoretical scale $0.05$–$0.40$), which scales sublinearly with latent concept count and exponentially with vocabulary (Miranda et al., 2023).
- Semantic diversity: for text, Vendi-style metrics compute diversity as an entropy over the eigenvalues of a normalized cosine-similarity matrix of embeddings (Li et al., 2024). For tabular data, diversity is measured as rule overlap between partitioned subsets (Tang et al., 26 Dec 2025).
- Class entropy and imbalance: e.g., for object detection, Shannon entropy of class distribution, and imbalance ratio (Duan et al., 2022).
- Domain/task/attribute coverage: e.g., the number of distinct domains, styles, annotation types, or difficulty levels (Liu et al., 2024, Zhang et al., 2 Aug 2025).
Quality is the accuracy, reliability, verifiability, and alignment of the content or annotation to intended usage. Key metrics and protocols include:
- Automatic scoring and filtering via LM-based Likert ratings, perplexity-based metrics, or reward models (He et al., 21 Oct 2025, Li et al., 2024).
- Factuality, completeness, consistency checks via multi-dimensional scores (He et al., 21 Oct 2025).
- Human-in-the-loop stage validation: inter-annotator agreement scores (e.g., Krippendorff's , Cohen's 0), precision/recall in pruning (Yu et al., 2022, Duan et al., 2022, Liu et al., 2024), test–retest (Zhao et al., 2024).
- Downstream performance: direct evaluation on diagnostic, zero-shot, and robustness benchmarks (Zhang et al., 2 Aug 2025, He et al., 8 Dec 2025, Liu et al., 2024).
2. Algorithmic Pipelines for Joint Scale, Diversity, and Quality
Multi-Dimensional and Orthogonal Selection
Single-score selection for quality often collapses correlated metrics, overfitting to a narrow region of data space and diminishing downstream performance (He et al., 21 Oct 2025). The ODiS (Orthogonal Diversity-Aware Selection) approach operationalizes a decorrelation pipeline:
- Multi-dimensional GPT-based scoring per instance, covering language, knowledge, comprehension, and information axes.
- PCA decorrelation of the score matrix yields 1 principal axes.
- RoBERTa regressors are trained to predict score projections, enabling scalable inference across massive pools.
- Token budget is allocated across axes; top-ranked samples per dimension are selected, minimizing intersection (empirically 22% overlap between dimensions).
- This decomposition addresses the documented non-monotonicity between highest per-sample scores and generalization (He et al., 21 Oct 2025).
Quality-Diversity Optimization
Quality-diversity (QD) approaches formalize data selection as a multi-objective optimization, trading off facility-location–style coverage with reward (quality) scoring. For instruction tuning, QDIT greedily maximizes a composite objective 3, where 4 is marginal diversity and 5 is predicted quality (Bukharin et al., 2023). The trade-off parameter 6 is empirically tuned in 7. Increasing diversity significantly raises worst-case robustness, and moderate volumes (810K well-chosen examples) often match or exceed random selection from much larger pools.
In synthetic grasping datasets, QD (MAP-Elites) maintains a grid archive of object-centric grasp behaviors, filling the space with both high-quality and diverse samples via robust simulation-based scoring (Huber et al., 2024).
Partitioning and Conditional Generation for Heterogeneity
For heterogeneous tabular domains, DATE partitions the data into distributionally coherent slices via decision tree–derived "distribution-guiding rules" (DGRs), then generates high-quality LLM-based synthetic data for each slice. A multi-armed bandit algorithm is applied to select a balanced subset that maximizes a convex combination of validation performance (quality) and partition overlap–aware diversity (Tang et al., 26 Dec 2025).
Scaling Laws and Semantic Diversity in Filtering
ScalingFilter avoids reference corpus bias by evaluating sample quality via the difference in perplexity between large and small LMs trained on the same data; higher perplexity drop signals intrinsically richer, higher-quality content (Li et al., 2024). To ensure diversity is preserved, semantic diversity scores (Vendi) are estimated over text embeddings after filtering.
3. Methods for Measuring and Validating Diversity
Strict diversity measurement, as opposed to assertion, is increasingly recognized as essential for scientific dataset documentation (Zhao et al., 2024). Measurement theory prescribes a four-stage workflow:
- Conceptualization: Precise definition of the diversity axis—feature, source, domain, subject, or annotator.
- Operationalization: Transform definitions into concrete indicators—distribution stats, coverage metrics, embedding clusters, or objective formulae.
- Reliability: Quantify inter-annotator agreement, apply test–retest protocols, ensure consistent data collection and labeling.
- Validity: Report convergent validity (correlation with independent diversity measurements/cross-dataset generalization) and discriminant validity (null correlation with unrelated variables).
Concretely, diversity can be quantified via entropy 9 (class or domain), Vendi (embedding-space entropy), or coverage fraction across defined attribute bins. Specific to synthetic generation, cosine similarity matrices of expert or model embeddings provide statistical coverage indices (Miranda et al., 2023, Li et al., 2024).
4. Empirical Benchmarking and the Scale–Diversity–Quality Trade-off
Reported Datasets
Contemporary datasets exemplifying best practice include:
- Pre-training corpora: The Pile, C4 (0) (Miranda et al., 2023).
- Multilingual NLP: Coverage of 222 languages, but with 1 having manual annotation; mean datasets per language 2 (Yu et al., 2022).
- Instruction tuning: MMInstruct (3973K instructions, 24 domains, manual + synthetic annotation pipeline, SOTA on 10/12 VLLM benchmarks) (Liu et al., 2024); QDIT supports robust selection in such corpora (Bukharin et al., 2023).
- Vision/editing: Fine-T2I (6.3M text–image pairs, 10 task axes, >95% candidates filtered), OpenVE-3M (3M instruction–video pairs, 8 edit types, long instruction tails, stringent VLM-based filtering) (Ma et al., 10 Feb 2026, He et al., 8 Dec 2025).
- Audio: NaijaVoices (1,800 h, 5,455 speakers, high SNR, extensive demographic + dialectal coverage) (Emezue et al., 26 May 2025).
- Robotics: QDGset (62M grasps, 40K objects, 6DOF, 16–20% evaluation reduction via bootstrapping) (Huber et al., 2024).
- Tabular: DATE achieves an average error reduction of 13.7% in classification and 47% in regression using only a few hundred synthetic points per partition (Tang et al., 26 Dec 2025).
Quantitative Trade-Offs
- ODiS achieves 4 intersection between axes, improving downstream accuracy by 2.8 points versus random selection (He et al., 21 Oct 2025).
- Quality-only selection can lower average performance by clustering near "easiest" types, but yields inferior robustness compared to QD-driven optimizers (Bukharin et al., 2023).
- Increasing data volume beyond well-chosen subsets (550K) can deliver diminishing or negative returns in average or worst-case metrics (Bukharin et al., 2023, Zhang et al., 2 Aug 2025).
- Filtering approaches such as ScalingFilter preserve both semantic diversity and downstream performance; Vendi diversity rises with multi-source aggregation (Li et al., 2024).
5. Case Studies: Domain-Specific Approaches and Implications
- Object detection (SODA): Scale (19,846 images, 286,201 objects), collection diversity (perspective, weather, phase), class balance (6 up to 6.5), quality assurance via multi-stage annotation, class entropy 7, mAP differences across YOLO variants (Duan et al., 2022).
- Fine-tuned multimodal data (Fine-T2I, OpenVE-3M): Multi-axis diversity (task, prompt, style/category), per-instance curation with 895% candidate rejection, multi-stage automatic/manual checks, cross-dataset A/B validation wins 970% (Ma et al., 10 Feb 2026, He et al., 8 Dec 2025).
- Math/QA synthesis (Big-Math, BoostQA): Large-scale aggregation (0250k questions), deduplication (semantic, regex, model solve), reformulation to open-ended format with unambiguous answer extraction (Albalak et al., 24 Feb 2025, Zhang et al., 2 Aug 2025).
6. Practical Guidelines and Measurement-Driven Best Practices
Pragmatic best practices, consistently endorsed across leading works, include:
- Explicitly define targeted diversity dimensions at project inception (Zhao et al., 2024).
- Integrate multi-axis, decorrelated or QD-based data selection algorithms—avoid uni-modal or single-score selection (He et al., 21 Oct 2025, Bukharin et al., 2023).
- Apply stringent automatic and human-in-the-loop quality filtering with high-coverage, low-bias validation (Liu et al., 2024, Ma et al., 10 Feb 2026, He et al., 8 Dec 2025).
- Regularly monitor dataset scale, diversity (entropy, coefficient, or embedding-space metrics), and downstream validation, iterating as necessary (Miranda et al., 2023, Li et al., 2024).
- Publish transparent documentation (datasheets) encompassing conceptual definitions, operationalization details, reliability and validity metrics, and known limitations (Zhao et al., 2024, Yu et al., 2022).
- Moderate scale according to empirical returns; maximize coverage and diversity per unit annotation/computation rather than pursuing scale alone (Bukharin et al., 2023, Zhang et al., 2 Aug 2025).
This synthesis, grounded in recent research, establishes dataset construction as a measurement-led, optimization-driven process wherein scale, diversity, and quality must be quantitatively specified, algorithmically maintained, and empirically justified to support reliable progress in contemporary machine learning.