MDG: MLaaS Dataset Generator

Updated 25 January 2026

MDG is a configurable framework that simulates MLaaS operations, enabling systematic benchmarking, workflow optimization, and integration with IoT systems.
It employs six tightly integrated stages to generate reproducible datasets across diverse ML models and tasks, using both IID and non-IID data splits.
MDG supports detailed performance metrics and QoS indicators, leading to up to 62% improvement in service selection accuracy and enhanced workflow composition.

A Machine Learning as a Service Dataset Generator (MDG) is a configurable framework for generating rich, reproducible datasets that systematically capture the behavior, performance, and composability of MLaaS service instances across real-world conditions. MDG is designed to simulate realistic MLaaS operations—spanning training, evaluation, and service composition—enabling rigorous benchmarking and downstream research on service selection, workflow optimization, and IoT system composition (Kanneganti et al., 18 Jan 2026).

1. Architectural Overview

MDG comprises six tightly integrated stages, structured to maximize reproducibility and coverage of MLaaS service diversity:

Input-Data Generation: Supports interactive (Wizard), controlled (Generate), and randomized (Autogen) entry points for dataset/model/hyperparameter selection, facilitating both guided and large-scale, automated simulation.
Dataset & Model Configuration: Implements normalization and partitioning; includes IID and non-IID data splits (Dirichlet $\alpha$ , shard, quantity-skew), reflecting federated and skewed deployment scenarios encountered in IoT contexts.
Individual MLaaS Simulation: Trains an array of models (CNN, RNN, MLP, MobileNetV2, Random Forest, Logistic Regression, K-means) using federated and centralized protocols, logging metrics at run, round, and client granularity with systematic SQLite persistence.
Composability Indicator Computation: Quantifies functional and cross-service compatibility via metrics such as Data Utility Measurement (DUM), Model Utility Measurement (MUM), Scalability Measurement (SM), Historical Quality Score (HQS), and Service Reliability Score (SRS).
Service Composition Executor: Executes parametric aggregations (e.g., weighted parameter averaging for neural models), ensemble-based approaches for non-parametric models, and maintains fidelity to real-world workflow aggregation patterns.
Dataset Export & Storage: Outputs comprehensive instance and composition datasets in CSV, JSON, and SQLite formats, facilitating broad integration with evaluation pipelines.

2. Simulation of Diverse Model Families

MDG supports training and evaluation of six major model classes across multiple canonical datasets (MNIST, Fashion-MNIST, Digits, CIFAR-10, Iris, Wine, California Housing):

All models are instantiated with exhaustive, grid-sampled hyperparameters: $\eta \in [10^{-4}, 10^{-1}]$ , $B \in [16, 256]$ , $E \in [1, 10]$ , $T \in [5, 50]$ rounds.
Data preprocessing includes feature scaling (none, standard, min–max), automated train/test splitting ( $\mathtt{test\_size}$ ), and supports both classification, regression, and clustering tasks.
Each federated simulation logs per-round metrics and final predictions, maintaining exhaustive traceability of each simulated MLaaS instance.
Evaluation metrics:
- Classification: Accuracy, Precision, Recall, $F_1$ -score (stored by run, round, client).
- Regression: RMSE.
- Clustering: silhouette score, inertia, ARI, NMI.
All service metrics are persistently tracked for detailed post hoc analysis.

3. Functional and Quality-of-Service Attributes

Every MLaaS service instance generated by MDG is annotated with comprehensive functional descriptors and QoS records, including:

Supported algorithms and task types (classification/regression/clustering).
Input/output formats and schema metadata (JSON feature vectors, tensor shapes, label types).
Hyperparameter sets (learning rate, batch size, etc.).
Data distribution strategies (IID/non-IID, Dirichlet $\alpha$ , shards, quantity-skew).
Endpoint details (API schema, authentication).
Measured QoS attributes under realistic IoT network perturbations:
- Response time $r_i$ , throughput $\tau$ , reliability $\rho$ , availability $A$ —aggregated by round/client and summarized per instance.

4. Composition-Specific Indicators and Optimization

MDG provides systematic computation of cross-service compatibility and optimal workflow selection:

Composability indicators:
- DUM: $1 - D_{\mathrm{KL}}(P_i \| P_j)$ quantifies distribution compatibility.
- MUM: $\alpha\,\mathrm{Accuracy}(s) + (1-\alpha)\,\mathrm{F}_1(s)$ balances performance attributes.
- SM: scalability ratio.
- HQS: moving average of workflow scores.
- SRS: long-term reliability.
Composition optimization:
- Objective: maximize total MUM subject to latency and cost constraints.
  
  $\max_{S\subseteq\mathcal{S}} \sum_{s\in S}\mathrm{MUM}(s) \quad \text{s.t.} \quad \sum_{s\in S} r(s) \le R_{\max},\; \sum_{s\in S} c(s) \le C_{\max}$
- Both parametric aggregations (for neural models) and non-parametric ensemble strategies are supported, reflecting the diversity of real-world MLaaS workflows.

5. Benchmark Dataset Composition and Statistical Properties

The current MDG release encompasses:

Attribute	Value/Range	Notes
Number of instances	10,432	Exhaustive across models/datasets
Datasets	MNIST, CIFAR-10, Iris, etc.	Seven standard tasks
Models	CNN, RNN, MLP, RF, etc.	Six major families
Task breakdown	~4k classification, etc.	Classification, regression, clustering
Data splits	50% IID, 50% non-IID	Dirichlet $\alpha$ in [0.1, 1]
Hyperparam ranges	$\eta$ , $B$ , $E$ , $T$	As above
Accuracy (classif.)	$\mathcal{N}(0.82, 0.05)$	Aggregated over runs
Response time (ms)	LogNormal $(5.3, 0.2)$	IoT emulated conditions
Reliability	$\beta$ -distributed $\sim$ 0.90	Across service instances

MDG thus provides fine-grained records suitable for benchmarking selection and composition strategies under diverse operational scenarios.

6. Integrated Service Composition Mechanism

MDG incorporates a native mechanism for automated workflow selection:

Input: Service registry S, constraints (R_max, C_max, accuracy_min)
Output: Best workflow W*
1. candidates ← filter S by constraints
2. indicators ← compute {DUM, MUM, SM, HQS, SRS} for all candidates
3. best_score ← -∞; W* ← ∅
4. for each subset W ⊆ candidates of size K:
    θ_W ← ∑_{s∈W} w_s · θ_s      # parametric aggregation
    ŷ_W ← majority_vote({ŷ_s | s∈W}) # ensemble for non-parametric
    acc_W ← evaluate(...)
    latency_W ← ∑_{s∈W} r(s)
    score_W ← α·acc_W - β·latency_W
    if score_W > best_score:
        best_score ← score_W; W* ← W
5. return W*

Optimization prioritizes accuracy under latency and cost constraints, as is typical in high-stakes IoT and MLaaS scenarios. This logic enables direct evaluation and improvement of algorithmic selection and composition strategies using the MDG-generated datasets.

7. Experimental Results and Practical Impact

In controlled comparisons, MDG-driven selection and composition approaches yield 12–62% higher selection accuracy and 10% higher composition quality versus traditional QWS-based baselines.
The benchmark datasets and composition mechanisms support robust research on MLaaS service matching, workflow structuring, and cross-service reliability in realistic IoT settings.
Empirical results (rule-based: 0.92 vs. 0.82, skyline-based: 0.81 vs. 0.50, composition score: 0.68 vs. 0.58) substantiate the utility of MDG for systematic and reproducible MLaaS research (Kanneganti et al., 18 Jan 2026).

MDG establishes a formal, extensible foundation for data-driven advancements in MLaaS benchmarking, selection, and service workflow composition, especially within heterogeneous and resource-variable environments typified by IoT deployments.

Markdown Report Issue Upgrade to Chat

References (1)

Machine Learning as a Service (MLaaS) Dataset Generator Framework for IoT Environments (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MLaaS Dataset Generator (MDG).