EmbraceNet: Robust Multimodal Fusion

Updated 17 October 2025

EmbraceNet is a multimodal fusion architecture that utilizes probabilistic feature selection to adaptively integrate data from varying modalities.
It leverages modality-specific docking layers and a stochastic embracement layer to maintain robust performance, even with missing inputs.
Empirical evaluations in sensor networks, activity recognition, image-text classification, emotion recognition, and mental health screening validate its superior accuracy and resilience.

EmbraceNet is a deep learning architecture for multimodal data fusion, specifically designed to maintain robustness and high performance even when some input modalities are partially or entirely missing. EmbraceNet introduces a probabilistic approach to modality selection within each feature dimension in its fusion layer, enabling adaptive integration of heterogeneous data sources. This methodology has been evaluated across various domains, including sensor networks, activity recognition, fine-grained image-text classification, emotion recognition, and mental health screening.

1. Architectural Principles

EmbraceNet is structured around two primary components: docking layers and an embracement (fusion) layer. Each raw modality, which could be represented by a CNN, RNN, dense layer output, or hand-crafted features, passes through its own docking layer. The docking layer linearly projects the modality’s output onto a common c-dimensional space:

$z_i^{(k)} = w_i^{(k)} \cdot x^{(k)} + b_i^{(k)}, \qquad d_i^{(k)} = f_a(z_i^{(k)})$

where $k$ indexes modalities, $x^{(k)}$ is the modality-specific feature vector, $w_i^{(k)}$ and $b_i^{(k)}$ are learnable weights and biases, and $f_a$ is a nonlinear activation (ReLU, sigmoid, tanh).

The fusion mechanism—named the embracement layer—performs stochastic modality selection across feature dimensions using multinomial sampling. For each feature index $i$ in the fused vector $e$ , a one-hot vector $r_i$ is drawn:

$r_i \sim \text{Multinomial}(1, p)$

where $p = [p_1, p_2, \ldots, p_m]$ is the modality probability vector. The fused feature is then computed:

$e_i = \sum_{k} (r_i^{(k)} \cdot d_i^{(k)})$

This guarantees that each component of $e$ is contributed by exactly one modality, randomly chosen, while the stochastic process trains the network to learn cross-modal correlations and produce robust features.

2. Multimodal Fusion and Robustness

EmbraceNet’s two-stage fusion approach begins with independent modality processing (docking layers) followed by the embracement layer’s probabilistic fusion. During training, the stochastic selection ensures exposure to partial modality activations, effectively functioning as dropout:

$r_i^{(k)} \sim \text{Bernoulli}(p_k)$

Crucially, when a modality is missing, its indicator $u_k$ is set to zero and its fusion probability $\hat{p}_k$ is recalculated:

$\hat{p}_k = \frac{u_k \cdot p_k}{\sum_j (u_j \cdot p_j)}$

which ensures absent modalities do not contribute to the fused vector. This adjustment enables seamless handling of missing data in both training and inference.

Randomization of selection probabilities during training further regularizes the system and enhances resilience to sensor failure or data corruption. This modality-aware stochastic dropout forces docking layers to learn conditionally robust representations, mitigating overfitting to any singular modality.

3. Domain-Specific Applications

EmbraceNet has demonstrated efficacy across a range of multimodal tasks:

Sensor Arrays and Human Activity Recognition: In the original work (Choi et al., 2019), EmbraceNet was validated on the Gas Sensor Arrays and OPPORTUNITY datasets. When deployed for chemical source identification and activity classification, EmbraceNet consistently showed smaller drops in F₁ score under missing or degraded conditions compared to classical fusion techniques (early/late/intermediate concatenation, autoencoder, multi-linear pooling).

Mobile Sensor Fusion for Activity Recognition: In activity recognition for the SHL challenge (Choi et al., 2020), EmbraceNet was configured with independent convolutional preprocessing for each of seven modalities, followed by the probabilistic fusion. Domain-specific FFT transformation, random rotation augmentation, and self-ensembling at output further enhanced accuracy. The EmbraceNet-based system achieved 65.22% accuracy, outperforming early fusion (46.73%) and reaching up to 87.1% accuracy for certain sensor positions after scaling up model capacity.

Fine-Grained Fashion Classification: GLAMI-1M benchmarking (Kosar et al., 2022) established EmbraceNet atop image-text encodings (ResNeXt-50 and mT5-small), with fusion probabilities statically set to ½ per modality. The best EmbraceNet result combined both modalities, yielding 69.7% top-1 accuracy, exceeding both single-modality and standard finetuned backbone baselines.

Emotion Recognition: In multimodal emotion recognition (Wang et al., 2023), a modified EmbraceNet fused body (ResNet-101) and pose (STGCN from OpenPose keypoints) features for robust classification in partially occluded or noisy scenarios. Fusion was further complemented by scene (ResNet-18), semantic (transformer-based patch embedding), and depth features. This architecture produced an average precision of 40.39% on EMOTIC’s 26 categories, a 5% improvement over prior state of the art.

Mental Disorder Screening: The GAME model (Du et al., 2023) integrated EmbraceNet fusion of eight modalities—ranging from facial features, physiological signs, voice embeddings (wav2vec 2.0, MFCCs), and textual cues (RoBERTa, PERT)—combined with a novel attention mechanism based on DTW-derived relation graphs. EmbraceNet’s fusion architecture allowed GAME to reach up to 92.77% accuracy and 91.06% weighted F₁ in adolescent mental disorder screening via a low-cost robot platform.

4. Mathematical Formulations

EmbraceNet’s core operations encompass projection and probabilistic fusion. The principal equations are:

Projection (Docking Layer):

$z_i^{(k)} = w_i^{(k)} \cdot x^{(k)} + b_i^{(k)}, \quad d_i^{(k)} = f_a(z_i^{(k)})$

Fusion (Embracement Layer):

$r_i \sim \text{Multinomial}(1, p)$

$d'^{(k)} = r^{(k)} \circ d^{(k)}$

$e_i = \sum_{k} d_i'^{(k)}$

Probability Update for Missing Modalities:

$\hat{p}_k = \frac{u_k \cdot p_k}{\sum_j (u_j \cdot p_j)}$

where $u_k$ is the modality presence indicator.

Attention Mechanism (GAME):

Dynamic Time Warping (DTW) computes inter-feature distances $d_i$ , weights are assigned via softmax:

$w_i = \frac{\exp(d_i)}{\sum_j \exp(d_j)}$

the weighted features are aggregated along optimally aligned sequences.

5. Empirical Performance and Comparative Analysis

Empirical results across referenced studies consistently show EmbraceNet’s resilience and competitive accuracy:

Task/Domain	Dataset	Modalities Used	Accuracy/F₁	Improvement Context
Sensor fusion, classification	Gas Sensor Arrays, OPPORTUNITY	8 (sensors), 19 (channels)	Higher F₁, smaller gap	Loss of sensors
Activity recognition	SHL challenge	7 (mobile sensor locations)	65.22–87.10%	Outperforms early/late fusion
Fashion image-text classification	GLAMI-1M	Image+Text	69.7% Top-1	Surpasses single-modal, ResNeXt-50
Emotion recognition	EMOTIC, others	Body+Pose+Scene+Semantic	40.39% AP	5% better than prior SOTA
Mental disorder screening	GAME	8 (vision/audio/text)	Up to 92.77% accuracy	High F₁, multi-modality, explainability

Modalities contribute dynamically. Ablation analyses consistently show marked performance drops if specific robust modalities (e.g., wav2vec, RoBERTa, scene, attention-based fusion) are removed, underlying the importance of resilient probabilistic fusion.

6. Applications and Practical Implications

EmbraceNet’s modality-agnostic fusion and inherent handling of missing data make it suitable for:

Sensor networks with intermittent connectivity (IoT)
Medical informatics with incomplete records
Activity and gesture recognition in wearable/mobile devices with unreliable sensors
Context-aware emotion recognition where occlusion or frame loss is common
Large-scale, explainable mental health screening tools

The architecture’s flexibility supports easy integration with modality-specific deep or shallow features and aids interpretability through dynamic fusion weights and ablation-based contribution assessment.

7. Limitations and Future Directions

While EmbraceNet excels in robustness, certain constraints remain. Fixed fusion probabilities can be suboptimal, particularly in long-tailed, multi-class settings (Kosar et al., 2022). Label noise in large datasets may also subtly impact the architecture’s performance. Additional modalities such as semantic and depth information further improve accuracy, suggesting ongoing potential for architectural extensions. Tuning of fusion probabilities and deeper investigation into cross-modal regularization represent plausible directions for refinement.

In summary, EmbraceNet provides a mathematically principled, modality-agnostic, and robust multimodal fusion strategy. Its stochastic modality selection mechanism both fuses complementary representations and regularizes for missing data, yielding high accuracy and resilience across diverse real-world applications.