Multimodal-RUL Prognostics Framework

Updated 14 December 2025

The multimodal-RUL framework is an integrated prognostics system that leverages heterogeneous sensor modalities, including vibration and image data, for precise RUL estimation.
It employs parallel CNN and Conv1D branches with LSTM-based temporal fusion and multi-head self-attention to capture spatial and temporal degradation signatures.
Additional physics-informed regularization and layer-wise relevance propagation enhance robustness, explainability, and performance under noisy and incomplete data conditions.

A multimodal-RUL (Remaining Useful Life) framework refers to an integrated prognostics system that simultaneously leverages heterogeneous data modalities—such as raw vibration signals, image representations, time-frequency representations, and multi-channel sensor data—to estimate the time until failure (RUL) of complex machinery. State-of-the-art multimodal-RUL frameworks combine advanced neural architectures, principled preprocessing pipelines, and explainability techniques to robustly and transparently predict RUL under practical constraints (limited, noisy, or missing data). Recent frameworks also incorporate physics-informed modeling, information-theoretic redundancy management, and layer-wise explanation mechanisms to enhance robustness, interpretability, and generalization (Razzaq et al., 7 Dec 2025, Nagaraj et al., 2024, Nguyen et al., 3 Sep 2025).

1. Multimodal Data Preprocessing and Representation

Multi-sensor environments in PHM (Prognostics and Health Management) generate high-dimensional, multi-source data (e.g., vibration, temperature, acoustic, image). Effective multimodal-RUL frameworks first preprocess such data to facilitate consistent downstream modeling:

Windowing and Normalization: Multichannel time-series signals are segmented into overlapping windows (e.g., of length $L_w=1000$ ) and min–max normalized to $[0,1]$ across each channel.
Image Representation (ImR): Each normalized window is transformed into a 2D raster using the Bresenham line algorithm, generating an image trace of size $64\times1000$ that encodes temporal signal geometry (Razzaq et al., 7 Dec 2025).
Time–Frequency Representation (TFR): Continuous Wavelet Transform (CWT, using Morlet wavelets) is applied to each windowed segment, producing rich, nonstationary TFR maps. Summary features (energy, dominant frequency, entropy, kurtosis, skewness, mean, std. dev.) are extracted from the CWT coefficients.
Other Modalities: Frameworks can generalize to further modality-specific encoders, including multivariate sensor time-series or other derived signal representations (Nagaraj et al., 2024, Nguyen et al., 3 Sep 2025).

The preprocessing stage ensures that each modality is mapped into a domain-specific feature space optimized for subsequent neural feature extraction and fusion.

2. Multimodal Neural Architecture Design

The canonical multimodal-RUL architecture comprises parallel modality-specific feature extractors, a fusion module, and a temporal modeling block:

Modality Branches:
- ImR branch: Deep CNN with multiple dilated convolutional blocks and residual connections extracts degradation features from rasterized vibration images.
- TFR branch: Hierarchical Conv1D blocks (with dilation and residual paths) distill temporal degradation features from TFR feature sequences.
Fusion Mechanism:
- Outputs from all branches are flattened and concatenated, yielding a unified multimodal feature tensor.
- A stack of LSTM layers (with residual shortcuts) models time dependencies and evolving degradation patterns across the concatenated multimodal representation.
- Multi-Head Self-Attention (H=8, head dim=64) further emphasizes salient temporal–modal attributes before final regression layers.
Output Head:
- Dense linear layers with ReLU nonlinearity map the global fused feature vector to a scalar RUL estimate.
Regularization:
- Weight decay is applied to all convolutional and fully-connected layers (e.g., $\lambda = 0.01$ ) (Razzaq et al., 7 Dec 2025).

This architecture enables discriminative and complementary information from each modality to be jointly leveraged, capturing both spatial and temporal signatures critical for accurate RUL prediction.

3. Training Objectives, Losses, and Optimization

Multimodal-RUL frameworks are trained with carefully designed supervision and regularization schemes:

Main Regression Loss: Mean Squared Error (MSE) over predicted and true RUL values:

$\mathcal{L}_{MSE}(\hat{y}, y) = \frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2$

Physics-Based Regularization (if adopted): For frameworks incorporating physics-informed machine learning (PIML), an SDE-based module estimates drift and diffusion parameters for each sensor. These are injected as auxiliary features and regularized to ensure that internal recurrent states are consistent with learned SDE properties:

$\mathcal{L}_{phy} = \lambda_\mu \sum_{i,k} \|F_{i,\mu}(h_{k-1}, x_k) - \hat{\mu}_i(k)\|^2 + \lambda_\sigma \sum_{i,k} \|F_{i,\sigma}(h_{k-1}, x_k) - \hat{\sigma}_i(k)\|^2$

(Nagaraj et al., 2024)

Contrastive and Reconstruction Losses (Robult paradigm):

Soft PU contrastive loss aligns redundancy features across modalities. A latent reconstruction loss ensures retention of unique, modality-specific information:

$\mathcal{L}_{PU} + \mathcal{L}_{rec} + \alpha \mathcal{L}_{sup}$

where $\alpha$ is annealed during training (Nguyen et al., 3 Sep 2025).

Optimizer: Adam is used with dynamic learning rate scheduling and early stopping based on validation loss (Razzaq et al., 7 Dec 2025).

This multi-objective loss design balances accurate RUL regression, multi-modal alignment/fusion, physical consistency, and robustness to incomplete or noisy modalities.

4. Robustness, Data Efficiency, and Handling Missing Modalities

Contemporary multimodal-RUL frameworks incorporate mechanisms to address typical challenges in industrial PHM:

Data Efficiency: Through robust modality fusion and feature redundancy management, frameworks can achieve equivalent or better RUL accuracy with significantly reduced training data (e.g., 28% less on XJTU-SY, 48% less on PRONOSTIA) (Razzaq et al., 7 Dec 2025).
Noise Robustness: Explicit experiments with injected uniform, Gaussian, or salt-and-pepper noise demonstrate that the architectures maintain RUL predictive accuracy with little performance degradation.
Missing Modalities: Late-fusion inference and dropout-augmented training enable consistent predictions even when individual sensor modalities become unavailable. Each modality-specific encoder can function independently, and final RUL estimates are aggregated from available branches (Nguyen et al., 3 Sep 2025).
Physics/Redundancy Extension: When different sensors reflect distinct degradation physics (e.g., power law vs. exponential), separate SDE modules or information-theoretic estimators can be used per regime (Nagaraj et al., 2024).

This multi-faceted approach ensures multimodal-RUL systems remain robust and generalizable in the presence of real-world data imperfections.

5. Explainability via Multimodal Layer-wise Relevance Propagation

Interpretability is enabled using multimodal Layer-wise Relevance Propagation (multimodal-LRP):

LRP Algorithm: All layer activations are tracked during the forward pass. During the backward pass, the final logit relevance is propagated through each layer using LRP-0 (pooling, addition, concatenation), LRP-ε (dense layers), and LRP-γ (convolutions) rules.
Branch Attribution: At Add layers (residual paths), relevance is split proportionally between main and residual paths. At Concatenate operations (fusion), relevance is evenly divided among modalities.
Visualization: Pixel-level (ImR) or feature-level (TFR) relevance maps indicate which input components are most influential for each RUL prediction. For vibration images, LRP heatmaps highlight peaks or sharp changes; for TFR features, statistical attributes (e.g., standard deviation, energy) receive dominant attribution (Razzaq et al., 7 Dec 2025).
Trust and Fault Diagnosis: Relevance propagation aligns with known failure indicators, increasing model transparency and utility for predictive maintenance operations.

Layer-wise multimodal explanations provide critical validation for end-users, facilitating diagnostic feedback and regulatory compliance in high-stakes industrial settings.

6. Comparative Performance and Application Scope

Empirical evaluation on standard PHM datasets confirms the efficacy of modern multimodal-RUL frameworks:

Dataset	Training Data Reduction	Robustness	Interpretability
XJTU-SY	28% less	High	Multimodal-LRP
PRONOSTIA	48% less	High	Multimodal-LRP

Performance: Multimodal branches and attention fusion routinely match or surpass single-modality and state-of-the-art baselines under both seen and unseen operating conditions (Razzaq et al., 7 Dec 2025).
Evaluation Metrics: RMSE, NASA’s asymmetric RUL Score, MAE, prediction bias, and proportion of predictions within specified $\pm \Delta t$ windows are standard.
Generalizability: The core paradigm seamlessly extends to heterogenous sensor unions, intermittent sensor dropout, partial-physics supervision, and broader industrial asset prognostics (Nagaraj et al., 2024, Nguyen et al., 3 Sep 2025).

7. Extensions and Future Directions

Multimodal-RUL research continues to evolve:

Integration of Physics-Informed Learning: Hybrid SDE/LSTM architectures allow explicit physics constraints and synthetic data augmentation, especially where the governing physical processes are only partially observed or dynamically estimated (Nagaraj et al., 2024).
Semi-supervised and Self-supervised Learning: Soft PU contrastive objectives and latent redundancy/uniqueness decomposition mitigate limited label availability.
Interpretable Generalization: Multimodal-LRP and information-theoretic tracking of feature provenance support transparent, domain-aware RUL estimation in high-consequence settings.
Plug-and-Play Modality Expansion: Modular design permits dynamic inclusion of new sensor types—or even modalities like images, audio, or multi-physics simulation traces—without retraining the entire system from scratch (Nguyen et al., 3 Sep 2025).
Application Scope: Typical targets include rotating machinery (e.g., rolling-element bearings, turbofans), with the frameworks applicable to any degradation process with observable multichannel signals.

A plausible implication is that future multimodal-RUL systems will further unify physics-data hybrids, fully modular sensor fusion, and certifiable explainability to enable cross-industry, scalable, and trustworthy prognostics.