AutoDCWorkflow: Automated Data Processes

Updated 20 March 2026

AutoDCWorkflow is a suite of modular pipelines that automate complex data-centric tasks such as cleaning, forensic preprocessing, and ML dataset enhancement.
It leverages LLM-driven planning, formal control logic, and systematic benchmarking to improve accuracy, throughput, and reproducibility.
The framework incorporates adaptive error recovery and rollback mechanisms, addressing domain-specific challenges in scalable, reliable automation.

AutoDCWorkflow encompasses a family of automated, modular workflows aimed at data-centric tasks, including data cleaning, digital evidence preparation, and high-throughput scientific characterization. Across applications, these workflows systematize the orchestration of multi-stage, error-prone processes through formalized control logic, state representation, operation scheduling, and evaluative benchmarking. The term spans LLM-driven data-cleaning planners, digital forensics preprocessing systems, and large-scale defect modeling pipelines, each leveraging automation to address scalability, reliability, and reproducibility within specialized domains.

1. Automated Data Cleaning Workflows

AutoDCWorkflow in the context of data cleaning refers to LLM-led automatic generation of data cleaning pipelines tailored to resolve quality issues in tabular datasets, as formalized in "AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark" (Li et al., 2024). The system inputs a "dirty" table $T_0$ and a data-analysis purpose $P$ , and outputs a minimal cleaned table $T_n$ along with an explicit, reproducible workflow $W_n$ .

The workflow follows a three-component pipeline:

Target Column Selection: An LLM identifies which columns are relevant to $P$ using table descriptions, column samples, and the purpose. This focuses downstream cleaning on the minimal necessary feature space.
Column Quality Inspection: For each selected column, another LLM module produces a Data Quality Report, assessing accuracy, relevance, completeness, and conciseness, outputting pass/fail for each.
Operation and Argument Generation: Based on the report, the LLM predicts the next cleaning operation and its arguments, choosing among a pool of atomic OpenRefine-compatible operations including trim, upper, numeric, date, mass_edit, and regexr_transform. The workflow alternates between LLM-driven planning and deterministic execution.

The formal abstraction is

$D_0 \xrightarrow{O_1} D_1 \xrightarrow{O_2} \cdots \xrightarrow{O_n} D_n$

where each $O_i$ is generated by LLM reasoning conditioned on $T_i$ and $P$ .

Key evaluation relies on a benchmark comprising annotated datasets (NYPL menus, Chicago Food Inspections, PPP loans, etc.), with columns artificially perturbed to inject realistic data pathologies. Metrics include answer F1, average match ratio between LLM-cleaned and gold tables, and workflow operation F1. In experiments, Llama 3.1 (7B) achieved the highest purpose answer F1 (0.61) and average answer similarity (0.74), exceeding both baseline and competing LLMs. Operation-level F1s ranged from 0.625 to 0.684 across models.

2. Workflow Automation in Digital Forensics

In "Increasing Digital Investigator Availability through Efficient Workflow Management and Automation" (Braekt et al., 2017), AutoDCWorkflow denotes an orchestrated automation framework for digital evidence preprocessing. The architecture comprises a client-server system with queue servers polling for jobs, invoking tool-specific wrappers, and enforcing job status transitions (queued → processing → succeeded/failed/locked).

The evidence acquisition process is semi-automated: users run an image creation utility to select devices and enqueue tasks. Server-side, each queue server atomically assigns jobs, checks for contention via lock files, and spawns forensic tools via language-agnostic wrappers (e.g., for Bulk Extractor, IEF). Task scheduling is configurable (FIFO), horizontally scalable, and supports automated error recovery.

Performance is characterized by formulas such as

$\text{Throughput}~\lambda = \frac{N_\text{processed}}{\Delta T}, \quad \text{Speedup}~S = \frac{T_\text{serial}}{T_\text{AutoDC}}$

End-to-end throughput and first-result time improved by 40% and 65% respectively in a real-world human-trafficking case study, with server CPU utilization reaching >85% during off-hours. Resource footprint minimization is achieved through centralized licensing and 24/7 job execution.

3. Automated Data-Centric Processing for ML Datasets

"AutoDC: Automated data-centric processing" (Liu et al., 2021) outlines an AutoDCWorkflow for image classification dataset improvement. The workflow decomposes into:

Embedding generation (ResNet50-based, no softmax, $P$ 0-dim features)
Outlier detection (per-class Isolation Forests, anomaly scoring)
Human-in-the-loop label correction (flag and correct outliers)
Edge-case selection (hard images identified via anomaly ranking)
Data augmentation (Gaussian noise, cropping, flipping, rotation, etc., with default parameters)

The flow is strictly DAG-structured: Embedding → Outlier Detection → (Label Correction | Edge-case Selection → Augmentation) → Retrain. The system achieves ~80% reduction in manual curation time and up to 15% absolute accuracy improvement on fixed-code models across diverse datasets.

4. High-Throughput Scientific Characterization: Magneto-Optical Properties

In "ADAQ: Automatic workflows for magneto-optical properties of point defects in semiconductors" (Davidsson et al., 2020), AutoDCWorkflow covers an end-to-end pipeline built atop the High-Throughput Toolkit and VASP for DFT calculations of point defects. Major stages include:

Host unit-cell relaxation (PBE functional, energy-force convergence criteria)
Supercell generation (target size, symmetry preserving)
Defect cluster construction (vacancies, substitutions, interstitials, geometric/deduplication constraints)
Ground-state workflows (multi-stage relaxation, IPR analysis, charge/spin branching, post-processing: PDOS, hyperfine, ZFS)
Excited-state workflows (ΔSCF, relaxation, post-processing: dipole, lifetimes)

Automation is orchestrated using httk task managers, with errors handled via checkpointing and parameter adjustments. Performance and key scientific results (ZPL, hyperfine tensors, formation energies) are reproducibly delivered for complex supercells (e.g., V_Si⁻ in 4H-SiC). Key DFT quantities, e.g., $P$ 1, formation energies, and transition levels, are systematically extracted.

5. Fault Tolerance, Adaptivity, and Robustness in AutoDCWorkflow

A critical property of advanced AutoDCWorkflow systems is robust error detection and recovery. In LLM-based document automation (Zhang et al., 4 Dec 2025), stepwise, rollback-enabled orchestration adapts to evolving state and detects misalignment via validator LLMs computing change-deltas ( $P$ 2); if decisions fail with high confidence, argument-level or API-level rollbacks are invoked. Empirical rollback triggers use a confidence threshold ( $P$ 3), and persistence of failure promotes escalation from argument regeneration to alternative API selection. These mechanisms ensure local error correction and session-level reliability, achieving substantial gains over baselines (instruction- and session-level improvements of +40% and +76%).

In digital forensics, error recovery encompasses job status tracking, lock-based concurrency control, and log-based diagnostics for failures, while in ADAQ workflows, automated checkpointing, convergence checking, and handler modules (for SCF stalls and parameter divergence) maintain robustness without human intervention.

6. Limitations and Future Directions

Across domains, AutoDCWorkflow implementations face limitations:

Restricted operation pools (LLM-driven systems lack joins, imputations, or joint column repair)
Potential LLM hallucinations or ambiguity in operation generation (arguments, regex)
Independence assumptions that preclude modeling of inter-column dependencies (cleaning)
Generalization challenges when scaling to large datasets or high cardinality
Requirement for tool wrappers and environment-specific configuration in forensic settings
Security, license compliance, and resource scaling in production deployments

Future work involves expanding operation libraries, integrating prior workflow retrieval, extending error-handling protocols, joint modeling of dependencies, and scaling automation to broader, more heterogeneous datasets and scientific problems.

7. Significance and Outlook

AutoDCWorkflow methodologies establish a paradigm for scalable, reliable, and reproducible automation in document-centered, data-centric, and scientific domains. They leverage modular planning, explicit state and provenance tracking, incremental error recovery, and tailored benchmarking to achieve performance and accuracy gains while reducing manual burden. These advances set a foundation for the next generation of adaptive, high-assurance automation agents in research, digital forensics, and computational science (Li et al., 2024, Braekt et al., 2017, Davidsson et al., 2020, Zhang et al., 4 Dec 2025, Liu et al., 2021).