Data Fusion Labeler (dFL) Workflow
- Data Fusion Labeler (dFL) is a comprehensive framework that fuses heterogeneous data using uncertainty-aware harmonization and provenance-rich labeling.
- It integrates data ingestion, modular operator-order processing, and both signal- and feature-level fusion to support traditional and machine learning pipelines.
- The system enhances throughput and accuracy in complex domains by ensuring real-time, reproducible, and schema-compliant data processing.
The Data Fusion Labeler (dFL) refers to a set of methodologies and toolchains for fusing, harmonizing, and labeling heterogeneous data, with particular impact in high-complexity domains such as fusion energy research. Across its variants, dFL tightly integrates uncertainty-aware harmonization, schema-compliant fusion, and provenance-rich labeling. It acts as a unified workflow instrument, supporting both traditional and machine learning pipelines by embedding operator-order-awareness, normalization, alignment, and complete reproducibility, resulting in substantial improvements in analysis throughput, label quality, and cross-device/model generality (Michoski et al., 12 Nov 2025).
1. System Architecture and Operator-Order Provenance
The dFL pipeline is modular, typically organized in four core stages:
- Data Ingestion utilizes a user-supplied
data_provider(often leveraging TokSearch or IMAS/OMAS accessors). - Data Harmonization executes a user-controlled pipeline of alignment (Trim—T), gap-fill (F), resampling (R), smoothing (S), and normalization (N), where non-commuting operators (e.g., ) are executed in a reproducibility-guaranteed, operator-order-aware fashion. The default order is .
- Data Fusion/Feature Extraction applies signal-level or feature-level methods, such as inverse-variance/stochastic stacking or complex feature computation via programmable Python modules.
- Labeling & Export provides GUI-driven manual and scalable automated labeling (“autolabelers”), exporting annotations and signals with all provenance metadata to institutional or open repositories.
All functions are first-class, provenance-tracked operators. Every transform, parameter, operator order, and manual action is logged via a machine-readable provenance graph, with hooks (manual_labeling_hook, label_export_hook, data_export_hook) providing traceable and auditable execution flows.
2. Uncertainty-Aware Data Harmonization
dFL employs comprehensive uncertainty management at each harmonization stage:
- Measurement-Error Propagation: Given diagnostics , with , signal fusion propagates uncertainties according to
for .
- Alignment & Interpolation: Each signal’s time grid remapped to a common via basis interpolation (linear, cubic, piecewise cubic Hermite (PCHIP), Akima). The operator
achieves temporal harmonization; an analogous projection works for spatial harmonization (e.g., mapping to flux coordinates ).
- Resampling & Smoothing: dFL supports five up-sampling (linear, spline, cubic, mono-PCHIP, Akima) and five down-sampling methods (block-mean, block-extrema, importance-sampling, GMM-MLE), with smoothing via moving average, EMA, Savitzky–Golay, Gaussian, Butterworth kernels. Each kernel’s effective transfer function is recorded for correct downstream uncertainty/fusion weighting.
- Normalization: Supports standard score, min–max, robust (median/IQR), and shifted Box–Cox ( for , otherwise), all invertible, with parameters stored in the provenance graph for precise de-normalization.
Uncertainty masks and samplewise variances are propagated throughout, enabling rigorous quantification of downstream analytic confidence.
3. Schema-Compliant Data Fusion
dFL reads and emits IMAS/OMAS-compliant objects, ensuring cross-device and cross-software interoperability.
- Signal-Level Fusion: Data streams may be stacked, combined via inverse-variance weighted means,
or updated sequentially in a Kalman filter style.
- Feature-Level Extraction: Derived features (e.g., radial mode amplitudes, energy content) are computed via programmable feature-maps, with tracked parameter and operator provenance.
- Custom Fusion: Dynamically injected Python modules permit user- or project-specific preprocessing or feature engineering, with all custom logic fully parameter-tracked.
This approach enables harmonization and fusion across highly heterogeneous diagnostics, and seamless export of results to major community data schemas.
4. Labeling Interface and Provenance Infrastructure
Each dFL workflow action—from data import, through harmonization, transform, manual label, or autolabeling—is registered in a machine-actionable provenance architecture comprising:
- Data Source Records (shot, diagnostic, software version)
- Transform Serializations (operator, parameters, order index)
- Label Events (label name, start/end, annotator, timestamp)
- Autolabeler Metadata (algorithm, hyperparameters, confidence distributions)
Manual, semi-automated, and batch labeling are all supported. GUI hooks allow annotators to link notes, trigger feature extractions, and designate ambiguous cases, while exports to the Core Metadata Facility (CMF) or local file systems recurse every transformation and labeling event. The result is a fully auditable, reproducible, and shareable annotation provenance suitable for cross-institutional workflows.
5. Performance, Accuracy, and Workflow Efficiency
dFL replaces fragmented, ad-hoc labeling and harmonization pipelines with a scalable, operator-aware system:
| Shots/hour labeled | Workflow | Speed-up |
|---|---|---|
| ~5 shots/day | Manual ELM per shot | 1× |
| ~200 shots/hour | dFL z-score ELM autolabel | ~250× |
| ~0.1 shots/h | Expert QH-mode labeling | 1× |
| ~50 shots/hour | dFL XGBoost QH-mode | ~500× |
Performance metrics:
- Automated ELM detection (z-score, causal 1ms window, |z(t)| > 3σ): F1 = 0.98, precision/recall 0.97–0.99, all 34 events detected (Shot 149092), throughput ~300 shots/hour.
- Multi-class plasma regime classification (QH/BBQH/WPQH): test accuracy >90% (XGBoost, ONNX export, real-time deployment).
- Labeling throughput improved by >50×, enabling 800k–1M records annotated in minutes to hours, compared to days for manual workflows.
All improvements are with respect to ad-hoc/manual baselines (Michoski et al., 12 Nov 2025).
6. Case Studies: Application to Fusion Energy Workflows
a) Automated ELM Detection
- Input: DIII-D filterscope Dα at 100 kHz.
- Harmonization: Min–Max normalization, gap-fill, per-window z-score.
- Event Detection: |z(t)| > 3σ triggers ELM onset/offset tags.
- Results: All ELMs in test shot (34/34) detected, F1=0.98, batch labeling >200 shots/hour.
b) Confinement Regime Classification
- Label domains: QH-mode, Broadband QH-mode, Wide-pedestal QH-mode.
- Features: Current, power, , pedestal width, mode amplitudes, densities, torque.
- Model: XGBoost, group K-fold, class-weighted, early stopping.
- Dataset: 360 shots, 792,000 labeled samples, extended to ~1,000,000 shots.
- Performance: >90% accuracy on held-out tests, ONNX export for real-time PCS sidecar inference.
These cases demonstrate not only batch throughput, but support for robust, real-time deployment in control systems.
7. Integration, Generalization, and Future Directions
- Real-Time PCS Interfaces: Lightweight z-score/slope-based detectors and mode-posterior classifiers feed plasma actuators or constrained MPC in real time.
- Multi-Device & Synthetic Diagnostic Fusion: Harmonization across DIII-D, NSTX-U, JET, as well as synthetic outputs (GENE, M3D-C1), enables development of unified predictors and disruptor forecasters.
- Cross-Domain Extension: The operator-aware, uncertainty-tracked, provenance-capturing dFL framework is directly applicable to multimodal time series in BCI, structural health monitoring, and IoT settings.
By converting non-reproducible, fragile data labeling into a scalable, transparent, and auditable workflow, dFL supports large-scale, high-integrity ML pipelines, cross-device learning, and rapid, high-confidence discovery in complex scientific domains (Michoski et al., 12 Nov 2025).