Automated Dataset Construction (ADC)
- Automated Dataset Construction (ADC) is a set of techniques that automatically gathers, annotates, and curates diverse machine learning datasets with minimal human input.
- It employs multi-view filtering, generative synthesis, and noise reduction methods to improve annotation quality and dataset diversity across various modalities.
- ADC accelerates research by reducing manual labeling efforts through high-throughput data collection and iterative refinement pipelines.
Automated Dataset Construction (ADC) encompasses a suite of methodologies and frameworks dedicated to the fully or near-fully automatic assembly of machine learning datasets across visual, audio, multimodal, and scientific domains. By leveraging algorithmic techniques—including web-scale retrieval, multi-view filtering, generative data synthesis, robust noise filtering, and multi-agent orchestration—ADC obviates or vastly reduces the need for costly, manual annotation cycles. Contemporary research demonstrates that ADC not only accelerates dataset availability but also injects greater diversity, domain coverage, and annotation robustness when compared to classical, hand-labeled dataset construction paradigms.
1. Principles and Formal Objectives
Modern ADC is grounded in several key technical requirements: (1) high-throughput collection of candidate samples from disparate real or synthetic sources; (2) automated and scalable annotation—often noisy—via class-conditional queries, pre-trained models, or synthetic generation; (3) multi-stage curation and noise-removal leveraging data-centric and learning-centric tools; and (4) reproducibility and auditability, typically enforced through provenance, versioning, and modular pipeline design.
The objective function for ADC may target not just coverage of the input domain, but also maximization of intra-class diversity, minimization of semantic and label noise, and explicit optimization for downstream model robustness. For example, the Adversarial Data Collection paradigm in robotics formalizes “information density” ρ of an episode and seeks to maximize it via an adversarial perturbation policy πₚ,
where ρ(E) quantifies the set of functionally unique demonstration units (Huang et al., 14 Mar 2025). In noisy-label regimes, ADC solutions often estimate or learn an instance-dependent noise transition matrix T(x), as in
to support loss correction or label refinement at scale (Liu et al., 2024).
2. Representational Workflows
The high-level ADC workflow typically involves the following main stages:
A. Query-Driven or Class-Attribute Design:
LLMs or domain-specific heuristics are used to enumerate class names, attribute-value pairs, or expansions (e.g., “white cotton shirt,” “police dog”) for maximal diversity in downstream sampling (Yao et al., 2017, Liu et al., 2024).
B. Automated Sample Collection:
Web-scale image/video/audio/text retrieval is orchestrated via code generation, search engine APIs, or web-crawlers, emphasizing coverage and intra-class variability (Yao et al., 2017, Liu et al., 2024, Sun et al., 11 Jul 2025).
C. Noisy Label Assignment:
Downloaded samples are annotated using their retrieval query, pre-trained classifiers, or multi-modal model outputs—a process prone to “label noise” which is subsequently managed in curation (Liu et al., 2024).
D. Filtering and Curation:
- Data-centric modules (e.g., SVM saliency, MIL, outlier detectors) and learning-centric routines (early-loss filtering, Mixup, co-teaching) are employed for denoising and instance selection. Multi-view and multi-instance learning are common, e.g.,
as used for multi-view textual query pruning (Yao et al., 2017).
E. Automated Annotation and Post-processing:
Vision-LLMs (e.g., Grounding DINO, SAM 2), automated segmentation, cropping, resizing, normalization, and augmentation are standard (Sun et al., 11 Jul 2025).
F. Dataset Output and Iterative Improvement:
Datasets are published with clean test/validation splits (often human-audited), richly structured metadata, and (in many frameworks) versioned provenance graphs or commit DAGs for reproducibility and collaboration (Kharitonov et al., 2023, Graziani et al., 27 Jan 2025).
3. Algorithmic Innovations and Core Models
ADC pipelines implement a heterogeneous suite of algorithms tailored to their target modality and goals:
- Multi-view Learning for Query Filtering: Casts candidate query expansion pruning as a co-regularized multi-view learning problem, e.g., combining semantic distance (Normalized Google Distance) with visual SVM scores (Yao et al., 2017).
- Multi-instance Learning (MIL): Data points are treated as “bags of instances” (e.g., query expansion → image bag), and inter/intra-bag filtering separates group or individual noise. MIL is operationalized via prototype embeddings and sparse 1-norm SVM LPs (Yao et al., 2017).
- Synthetic Data Generation: Generative pipelines can synthesize new data (images, 3D, audio) via combinations of object detection, blending, and matting (for example, center–outline prediction + Poisson blending) (Lin et al., 2022). For 3D robotics, affordance-based optimization delivers viable grasps via geometric losses and force-closure (Wang et al., 12 Nov 2025).
- Multi-agent Systems: Orchestrated LLM/VLM agents manage demand parsing, collection, annotation, error handling, and batch optimization—enabling robust, parallelized expansion or de novo construction (see DatasetAgent) (Sun et al., 11 Jul 2025).
- Data-centric Label Correction: Outlier detection with embeddings, Isolation Forests, and robust re-labeling or reweighting guards against annotation drift and label noise (Liu et al., 2021).
- Curation for Imbalanced and Noisy Data: Techniques include robust loss objectives (Forward/Backward correction, LDAM, Focal loss), instance reweighting, and structured subsampling for long-tailed distributions (Liu et al., 2024).
4. Domains and Modalities
ADC is effective across classic and emerging data types:
| Modality | Representative ADC Frameworks | Key Techniques |
|---|---|---|
| Web Images | WSID-100 (Yao et al., 2017), DatasetAgent (Sun et al., 11 Jul 2025) | Multi-view/MIL, LLM demand modeling |
| Satellite Imagery | SentinelDataDownloaderTool (Sebastianelli et al., 2020) | Cloud-masked patch tiling |
| Audio–Text | Auto-ACD (Sun et al., 2023) | Multimodal cue fusion, LLM captions |
| Robotic Manipulation (3D) | ScaleADFG (Wang et al., 12 Nov 2025), ADC-Robotics (Huang et al., 14 Mar 2025) | Affordance optimization, real-world |
| General Scientific Data | Analysis Automation at Scale (Graziani et al., 27 Jan 2025) | Multi-agent ingestion, summarization |
| Generative Datasets | Dataset Factory (Kharitonov et al., 2023) | Metadata-driven selection, versioning |
| Data-centric processing/image | AutoDC (Liu et al., 2021) | Outlier correction, augmentation |
5. Evaluation Protocols and Empirical Outcomes
ADC system evaluation is multidimensional:
- Classification/Detection Accuracy: Consistent accuracy gains on clean test sets, e.g., ADC-generated WSID-100 achieves 53.9% (AlexNet) vs. 19.0% for CIFAR-10 when transferred to VOC 2007 (Yao et al., 2017). DatasetAgent expansion improves YOLOv8 [email protected] from 76.3%→81.0% on VOC2007 (Sun et al., 11 Jul 2025).
- Noise Detection: Advanced detectors such as Simi-Feat achieve F₁-scores up to 0.5721 on large web-collected image sets (Liu et al., 2024).
- Data quality and utility metrics: Metrics including class balance index (CBI), structural similarity (SSIM), annotation reliability (ALR), diversity (SDI), and distribution consistency (DDC) are reported for comprehensive quality assurance (Sun et al., 11 Jul 2025).
- Labor Reduction: Human effort is typically reduced by 80–90% for pipeline curation on real datasets, with minimal expert input focused only on outlier or edge-case review (Liu et al., 2021, Sun et al., 11 Jul 2025).
- Synthetic Data Fidelity: Overlap in distributional statistics between synthetic and real samples (e.g., histogram area overlap >95%) validates ADC in generative setups (Graziani et al., 27 Jan 2025).
6. Challenges, Limitations, and Prospects
ADC systems face characteristic challenges:
- Label Noise and Imbalance: Web-retrieved data are inherently noisy; integrated data-centric (CleanLab, Docta, Simi-Feat) and learning-centric (co-teaching, robust losses) routines are required for achieving high fidelity (Liu et al., 2024).
- Domain Generalization and Transfer: Biases in query design, synthetic artifact introduction (illumination, scale), and limited domain coverage can constrain model performance; robust curation, diversification, and adversarial/affordance-driven perturbations are effective mitigations (Huang et al., 14 Mar 2025, Wang et al., 12 Nov 2025).
- Automation Limitations: Certain domains still require manual supervision for seed template creation (e.g., 3D affordance labels), or for final validation of test sets (Wang et al., 12 Nov 2025, Liu et al., 2024).
- Infrastructure Requirements: Large-scale ADC architectures (e.g., Dataset Factory) demand object storage integration, columnar/ANN indexing, and scalable metadata processing to operate on petabyte-scale archives (Kharitonov et al., 2023).
Potential research vectors include extension to new modalities (audio, video, medical/scientific imaging), progressive augmentation with synthetic or GAN-based virtual samples, and unified joint optimization of data and model parameters (“AutoDC + AutoML” integration) (Liu et al., 2024, Liu et al., 2021).
7. References to Key Systems and Benchmarks
Representative frameworks and datasets in ADC include:
- WSID-100: Web-scale, multi-query, multi-instance benchmark for open-domain image classification and detection (Yao et al., 2017).
- DatasetAgent: Multi-agent, MLLM-driven construction for classification, detection, and segmentation (Sun et al., 11 Jul 2025).
- ScaleADFG: Fully automated affordance-based 3D grasping dataset pipeline, supporting dexterous robotic generalization (Wang et al., 12 Nov 2025).
- Auto-ACD: Audio–language multimodal pipeline achieving state-of-the-art audio-captioning and retrieval (Sun et al., 2023).
- Dataset Factory: Scalable toolchain for metadata-centric selection, curation, and versioning for petabyte-level generative datasets (Kharitonov et al., 2023).
- AutoDC: Modular data-centric pipeline with automated label correction, edge-case enhancement, and augmentation (Liu et al., 2021).
- SentinelDataDownloaderTool: End-to-end EO/satellite imagery dataset builder with reproducible GUI and scripting (Sebastianelli et al., 2020).
These advances confirm that rigorous automation of dataset construction is feasible at scale—and, in many cases, yields datasets with superior diversity, robustness, and transfer performance relative to traditional, fully human-annotated pipelines.