Data Fusion Labeler (dFL) Workflow

Updated 14 November 2025

Data Fusion Labeler (dFL) is a comprehensive framework that fuses heterogeneous data using uncertainty-aware harmonization and provenance-rich labeling.
It integrates data ingestion, modular operator-order processing, and both signal- and feature-level fusion to support traditional and machine learning pipelines.
The system enhances throughput and accuracy in complex domains by ensuring real-time, reproducible, and schema-compliant data processing.

The Data Fusion Labeler (dFL) refers to a set of methodologies and toolchains for fusing, harmonizing, and labeling heterogeneous data, with particular impact in high-complexity domains such as fusion energy research. Across its variants, dFL tightly integrates uncertainty-aware harmonization, schema-compliant fusion, and provenance-rich labeling. It acts as a unified workflow instrument, supporting both traditional and machine learning pipelines by embedding operator-order-awareness, normalization, alignment, and complete reproducibility, resulting in substantial improvements in analysis throughput, label quality, and cross-device/model generality (Michoski et al., 12 Nov 2025).

1. System Architecture and Operator-Order Provenance

The dFL pipeline is modular, typically organized in four core stages:

Data Ingestion utilizes a user-supplied data_provider (often leveraging TokSearch or IMAS/OMAS accessors).
Data Harmonization executes a user-controlled pipeline of alignment (Trim—T), gap-fill (F), resampling (R), smoothing (S), and normalization (N), where non-commuting operators (e.g., $N \circ S \ne S \circ N$ ) are executed in a reproducibility-guaranteed, operator-order-aware fashion. The default order is $O_0 = T \rightarrow F \rightarrow R \rightarrow S \rightarrow N \rightarrow E$ .
Data Fusion/Feature Extraction applies signal-level or feature-level methods, such as inverse-variance/stochastic stacking or complex feature computation via programmable Python modules.
Labeling & Export provides GUI-driven manual and scalable automated labeling (“autolabelers”), exporting annotations and signals with all provenance metadata to institutional or open repositories.

All functions are first-class, provenance-tracked operators. Every transform, parameter, operator order, and manual action is logged via a machine-readable provenance graph, with hooks (manual_labeling_hook, label_export_hook, data_export_hook) providing traceable and auditable execution flows.

2. Uncertainty-Aware Data Harmonization

dFL employs comprehensive uncertainty management at each harmonization stage:

Measurement-Error Propagation: Given diagnostics $y_i(t_i) = H_i x(t) + \epsilon_i$ , with $\mathrm{Var}[\epsilon_i] = \sigma_i^2$ , signal fusion propagates uncertainties according to

$\mathrm{Var}[z] = \sum_i w_i^2 \sigma_i^2 + 2 \sum_{i<j} w_i w_j \mathrm{Cov}(\epsilon_i, \epsilon_j)$

for $z = \sum_i w_i y_i$ .

Alignment & Interpolation: Each signal’s time grid $t_i$ remapped to a common $t$ via basis interpolation (linear, cubic, piecewise cubic Hermite (PCHIP), Akima). The operator

$x_i^*(t) = \sum_k \varphi_k(t) x_i(t_{i,k})$

achieves temporal harmonization; an analogous projection works for spatial harmonization (e.g., mapping to flux coordinates $\psi_N$ ).

Resampling & Smoothing: dFL supports five up-sampling (linear, spline, cubic, mono-PCHIP, Akima) and five down-sampling methods (block-mean, block-extrema, importance-sampling, GMM-MLE), with smoothing via moving average, EMA, Savitzky–Golay, Gaussian, Butterworth kernels. Each kernel’s effective transfer function is recorded for correct downstream uncertainty/fusion weighting.
Normalization: Supports standard score, min–max, robust (median/IQR), and shifted Box–Cox ( $x' = ((x+\delta)^\lambda - 1)/\lambda$ for $\lambda \ne 0$ , $\ln(x+\delta)$ otherwise), all invertible, with parameters stored in the provenance graph for precise de-normalization.

Uncertainty masks and samplewise variances are propagated throughout, enabling rigorous quantification of downstream analytic confidence.

3. Schema-Compliant Data Fusion

dFL reads and emits IMAS/OMAS-compliant objects, ensuring cross-device and cross-software interoperability.

Signal-Level Fusion: Data streams may be stacked, combined via inverse-variance weighted means,

$x_f(t) = \frac{\sum_i \sigma_i^{-2} x_i(t)}{\sum_j \sigma_j^{-2}}, \quad \mathrm{Var}[x_f] = 1 / \sum_i \sigma_i^{-2}$

or updated sequentially in a Kalman filter style.

Feature-Level Extraction: Derived features (e.g., radial mode amplitudes, energy content) are computed via programmable feature-maps, with tracked parameter and operator provenance.
Custom Fusion: Dynamically injected Python modules permit user- or project-specific preprocessing or feature engineering, with all custom logic fully parameter-tracked.

This approach enables harmonization and fusion across highly heterogeneous diagnostics, and seamless export of results to major community data schemas.

4. Labeling Interface and Provenance Infrastructure

Each dFL workflow action—from data import, through harmonization, transform, manual label, or autolabeling—is registered in a machine-actionable provenance architecture comprising:

Data Source Records (shot, diagnostic, software version)
Transform Serializations (operator, parameters, order index)
Label Events (label name, start/end, annotator, timestamp)
Autolabeler Metadata (algorithm, hyperparameters, confidence distributions)

Manual, semi-automated, and batch labeling are all supported. GUI hooks allow annotators to link notes, trigger feature extractions, and designate ambiguous cases, while exports to the Core Metadata Facility (CMF) or local file systems recurse every transformation and labeling event. The result is a fully auditable, reproducible, and shareable annotation provenance suitable for cross-institutional workflows.

5. Performance, Accuracy, and Workflow Efficiency

dFL replaces fragmented, ad-hoc labeling and harmonization pipelines with a scalable, operator-aware system:

Shots/hour labeled	Workflow	Speed-up
~5 shots/day	Manual ELM per shot	1×
~200 shots/hour	dFL z-score ELM autolabel	~250×
~0.1 shots/h	Expert QH-mode labeling	1×
~50 shots/hour	dFL XGBoost QH-mode	~500×

Performance metrics:

Automated ELM detection (z-score, causal 1ms window, |z(t)| > 3σ): F1 = 0.98, precision/recall 0.97–0.99, all 34 events detected (Shot 149092), throughput ~300 shots/hour.
Multi-class plasma regime classification (QH/BBQH/WPQH): test accuracy >90% (XGBoost, ONNX export, real-time deployment).
Labeling throughput improved by >50×, enabling 800k–1M records annotated in minutes to hours, compared to days for manual workflows.

All improvements are with respect to ad-hoc/manual baselines (Michoski et al., 12 Nov 2025).

6. Case Studies: Application to Fusion Energy Workflows

a) Automated ELM Detection

Input: DIII-D filterscope Dα at 100 kHz.
Harmonization: Min–Max normalization, gap-fill, per-window z-score.
Event Detection: |z(t)| > 3σ triggers ELM onset/offset tags.
Results: All ELMs in test shot (34/34) detected, F1=0.98, batch labeling >200 shots/hour.

b) Confinement Regime Classification

Label domains: QH-mode, Broadband QH-mode, Wide-pedestal QH-mode.
Features: Current, power, $\beta_n$ , pedestal width, mode amplitudes, densities, torque.
Model: XGBoost, group K-fold, class-weighted, early stopping.
Dataset: 360 shots, 792,000 labeled samples, extended to ~1,000,000 shots.
Performance: >90% accuracy on held-out tests, ONNX export for real-time PCS sidecar inference.

These cases demonstrate not only batch throughput, but support for robust, real-time deployment in control systems.

7. Integration, Generalization, and Future Directions

Real-Time PCS Interfaces: Lightweight z-score/slope-based detectors and mode-posterior classifiers feed plasma actuators or constrained MPC in real time.
Multi-Device & Synthetic Diagnostic Fusion: Harmonization across DIII-D, NSTX-U, JET, as well as synthetic outputs (GENE, M3D-C1), enables development of unified predictors and disruptor forecasters.
Cross-Domain Extension: The operator-aware, uncertainty-tracked, provenance-capturing dFL framework is directly applicable to multimodal time series in BCI, structural health monitoring, and IoT settings.

By converting non-reproducible, fragile data labeling into a scalable, transparent, and auditable workflow, dFL supports large-scale, high-integrity ML pipelines, cross-device learning, and rapid, high-confidence discovery in complex scientific domains (Michoski et al., 12 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

The Data Fusion Labeler (dFL): Challenges and Solutions to Data Harmonization, Labeling, and Provenance in Fusion Energy (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data Fusion Labeler (dFL).