iDataScience: Automated Data Pipeline
- iDataScience is a fully automated end-to-end data science process integrating statistical and machine learning modules to convert raw data into actionable insights.
- It employs a modular pipeline for data ingestion, profiling, feature suggestion, model recommendation, and insight generation, ensuring reproducibility.
- By automating workflow stages, iDataScience eliminates manual data wrangling, accelerates analytics, and democratizes advanced computational techniques.
iDataScience refers to the full automation and augmentation of the end-to-end data science process, enabling the conversion of raw data into actionable insights through a chained sequence of statistical and machine learning modules, with minimal human intervention. Realized in practice by the Augmented Data Science (ADS) framework, iDataScience encapsulates the “industrialization and democratization” of analytics by replacing manual, ad hoc workflows with modular, domain-agnostic, and largely automated analytical building blocks (Uzunalioglu et al., 2019).
1. System Architecture and Pipeline
iDataScience, as instantiated in ADS, consists of five major, sequential modules that collectively perform the essential stages of data science:
- Data Ingestion
- Data Profiling
- Feature Suggestion
- Model Recommendation
- Insight Generation
Each module automates a traditionally labor-intensive component of the pipeline. Data flows from left to right through the pipeline, with each module optionally parameterized by a small set of tunable thresholds (e.g., outlier α, pattern-tolerance p, top-M features), ensuring both efficiency and consistency. The modules are partially overlapping but are designed for tight integration and direct handoff of data and metadata between stages.
| Stage | Core Functions | Typical Input/Output |
|---|---|---|
| Data Ingestion | Schema-on-read, sampling, connection to various sources | Raw tabular data → Sampled data pointer |
| Data Profiling | Type inference, structural discovery, statistics, patterns | Sampled data → Data/Relation graphs |
| Feature Suggestion | Automated engineering, aggregation, relation graph | Relation graph → Feature set |
| Model Recommendation | Feature selection, algorithm/hyperparam. search, CV eval | Feature set → Ranked pipelines |
| Insight Generation | Human-readable reports, visualizations, pattern surfacing | Pipelines, data → Summary output |
The pipeline’s design enables rapid, domain-agnostic transformation of diverse tabular data into reproducible, interpretable outputs.
2. Algorithmic and Statistical Foundations
The iDataScience workflow relies on established statistical foundations for automation and reproducibility:
- Summary Statistics: For a column , the mean and variance are computed as , .
- Distribution Fitting: Each feature is modeled against parametric families (e.g., Gaussian, Poisson) via maximum likelihood, .
- Outlier Detection: Outliers are flagged using -scores, with , marking (default ).
- Relationship Mining: Pearson correlation , mutual information 0, and Goodman–Kruskal 1 for thematic dependencies.
- String Pattern Mining: Maximal common string patterns are mined by maximizing 2 such that 3.
- Relational Path Enumeration: Paths through a relation graph 4 are computed from a user-specified anchor, supporting one-to-one and one-to-many aggregations.
These algorithmic steps are selected for their domain-agnostic applicability and tractability under automated control.
3. Detailed Module Descriptions
Data Ingestion uses schema-on-read to connect to diverse tabular data sources, loading representative samples for efficient downstream processing while maintaining pointers to full datasets.
Data Profiling automatically infers primitive types (integer, float, string, datetime) and applies contextual labeling (e.g., “email”, “zip code”), even in the absence of metadata. High-coverage summary statistics, distributional modeling, outlier detection, inter-column dependency graphs (Pearson, Spearman, mutual information, Goodman–Kruskal), and string pattern mining jointly enable comprehensive data characterization. Systematic missingness is identified using co-missingness clustering and predictive rule learning.
Feature Suggestion automates feature engineering through both column-wise and row-wise transformations. From the schema and detected functional dependencies, a relation graph is built. For a chosen anchor (e.g., customerID), all relational paths are enumerated, with aggregation functions (min, max, mean, count, std, sum for numerics; count and distinct-count for categoricals) applied accordingly. For temporal data, dozens of statistical, spectral, and autocorrelation features are generated as per TSFresh conventions.
Model Recommendation manages full model selection: imputing/flagging missing values, stratified train/test splits, two-stage feature selection (pairwise ranking via 5 or mutual information, followed by greedy forward selection under cross-validation), randomized or Bayesian hyperparameter search, k-fold CV model training, and evaluation using AUC or RMSE depending on task. The module outputs a ranked list of top-6 pipelines, each specifying algorithm, hyperparameters, and pre/post-processing configuration.
Insight Generation produces human-readable and visual summaries. Example outputs include:
- “Column X is numeric (integer) with 7, 8, 7% outliers beyond 9.”
- “Columns A and B have 0, suggesting redundant information.”
- “String field ‘email’ matches pattern @.* for 99.8% of rows; extracted user-name and domain subfields.”
- “When FLAG_OWN_CAR=N, 45 other fields are systematically missing; created ‘missing_flag_CAR’ feature.” Additionally, relation/association graphs, missingness clusters, feature importances, and model performance curves are visualized.
4. Example Workflows and Case Studies
Toy End-to-End Example
Given a schema with Customer, Order, and Product tables (with columns like customerID, email, age, orderID, productID, price):
- Data Profiling: Infers email type, extracts username/domain, computes 1 for age, analyzes missingness.
- Feature Suggestion: Paths such as customerID→order→product→price are traversed, producing for each customer aggregates like sum_price, mean_price, count_orders.
- Model Recommendation: For a binary churn target, features such as email_domain, sum_price, mean_price, count_orders are selected. 5-fold CV is run with logistic regression, random forest, and XGBoost; the top pipeline is random-forest with 2, 3, 4.
- Insight Generation: Reports phrases such as “Customers with high count_orders52 have 0.92 probability of churn (6).” and “email_domain=‘gmail.com’ appears in 60% of churn cases.”
Large-scale Case Study: Home Credit Default Risk
- Dataset: 7 interconnected tables, ~308K training, ~48K test rows, 50M+ transactions.
- Workflow: Data Profiling discovered 55 highly correlated columns; missingness clustering generated new “missing_flag” features. Feature Suggestion (anchor=SK_ID_CURR) generated ~1,385 features, including aggregates and temporal statistics. Model Recommendation evaluated random-forest, lightGBM, logistic regression; lightGBM achieved 7.
- Impact: Achieved 8 (close to top Kaggle leaderboard 9) with 0 automated iterations, compared to hundreds of manual iterations by experts. Estimated 80% reduction in time for data understanding and feature engineering.
5. Industrialization and Democratization Impact
The underlying motivation for iDataScience is the removal of the “human bottleneck” in data science workflows. Each module replaces manual and domain-specific steps with scalable automation. Data Profiling substitutes for manual schema inspection and exploratory analysis; Feature Suggestion eliminates hand-coded feature engineering; Model Recommendation automates algorithm selection and tuning; and Insight Generation surfaces statistical patterns for interpretation by both experts and domain novices.
All processes in the pipeline are designed to be domain-agnostic, relying solely on statistical descriptors (mean, variance, correlation, mutual information) and simple schema operations, ensuring portability and accessibility. The interactive UI allows users to adjust thresholds or drill down on particular patterns, with sensible defaults (e.g., pattern-tolerance 1, outlier 2, top_M=100 features) established for cross-domain functionality. Outputs are standardized as tables, scoring files, and PDF/HTML reports.
By tightly integrating these modules over robust statistical foundations and enabling configurable augmentation, iDataScience—via ADS—industrializes and democratizes end-to-end analytics for both data scientists and domain experts (Uzunalioglu et al., 2019).