CentralNet: Adaptive Multimodal Fusion
- CentralNet is a deep multimodal fusion architecture that adaptively integrates modality-specific networks using trainable fusion weights.
- It bridges early and late fusion strategies via a central module, yielding significant accuracy improvements on diverse benchmarks.
- The architecture leverages multi-task regularization and dynamic fusion depth selection to ensure robust performance across tasks.
CentralNet is a deep multimodal fusion architecture that introduces a trainable, multilayer central network responsible for adaptive integration of hidden representations from multiple modality-specific subnetworks. Its design provides a continuous interpolation between early and late fusion strategies, driven by learnable fusion weights at each network depth and regularized via multi-task learning objectives. CentralNet has been validated across diverse multimodal benchmarks, consistently yielding improved accuracy relative to state-of-the-art baselines by automatically selecting optimal fusion depths for each task (Vielzeuf et al., 2018, Vielzeuf et al., 2018).
1. Motivation and Overview
The challenge in multimodal deep learning lies in how and when to fuse independent modality streams—early (at feature level), late (at decision/score level), or mid-level. No universal best practice exists, as optimal fusion depth is highly task-dependent. Traditional approaches tend toward architectural locality, either concatenating features early on or aggregating predicted scores late. CentralNet proposes a central fusion module that interfaces with each unimodal network at every layer, combining their representations via learnable scalar weights. This realizes a multi-level fusion framework in which the network dynamically learns "how much" and "where" to fuse across modalities (Vielzeuf et al., 2018, Vielzeuf et al., 2018).
2. Architecture and Mathematical Formulation
Consider modality-specific networks (), each generating hidden representations at depth (), with as the input feature and as the logit vector. The central network is constructed with the same depth , maintaining its hidden state , initialized with or by a weighted sum of .
At every layer , the central network fuses all and its own previous state by a learned weighted sum as follows:
where are trainable, scalar fusion weights associated respectively with the -th modality and the central net self-connection, is bias, is central layer weight matrix, and is a nonlinearity (usually ReLU).
At the output, the logits of all unimodal branches and the central branch can be fused by another weighted sum before the final softmax or sigmoid, thus realizing generalized late fusion as a special case.
3. Multi-Task Learning Objective and Regularization
CentralNet employs a multi-objective loss to ensure the central fusion improves over, but does not degrade, unimodal branch performance. Let denote the softmax output of the central net, and the unimodal outputs. With ground-truth , central and unimodal cross-entropy losses are given as
The overall loss takes the form: In all reported experiments, , so the fusion and unimodal objectives are optimized jointly, regularizing the entire architecture and stabilizing training. For multilabel setups (MM-IMDb), a weighted binary cross-entropy with positive class weight is applied for the central and unimodal outputs (Vielzeuf et al., 2018, Vielzeuf et al., 2018).
4. Implementation and Training Protocol
CentralNet is designed to wrap arbitrary modality-specific deep networks. In each branch, best-practice modules (e.g., LeNet-5 CNN for image, MLPs for audio/text) are used. Dense and convolutional layers are regularized via batch normalization and dropout (typically on dense layers).
The recommended training protocol is:
- (Optional) Pre-train each unimodal subnetwork independently using its own data and loss .
- Initialize the central net’s parameters and all fusion weights.
- Jointly fine-tune the entire architecture (central and all modality-specific branches) using Adam optimizer (default ), optimizing on all samples.
- Use dataset-specific learning rates and schedule, apply early stopping where appropriate; apply batch-wise dropout and batch normalization to combat overfitting.
Typical hyperparameters include a batch size of $128$ (or $42$ for Montalbano), learning rates in , and 100 epochs (unless specified) (Vielzeuf et al., 2018, Vielzeuf et al., 2018).
5. Empirical Evaluation Across Multimodal Tasks
CentralNet was validated on four diverse benchmarks:
- Multimodal MNIST (two artificial views): Each branch is LeNet-5; CentralNet achieves a 5% relative error reduction over best baselines (53% errors vs. 56–68% for late/early fusions, ModDrop, GMU).
- Audiovisual MNIST (corrupted images + audio spectrograms): CentralNet yields 95.0% accuracy (), outperforming the 94.8% baseline.
- ChaLearn Montalbano (4-modal gesture recognition): Macro-accuracy for CentralNet is 98.27% ()—a statistically significant improvement over ModDrop (98.19%), GMU (97.98%) and other fusions.
- MM-IMDb (poster+plot genre multilabel): CentralNet attains a micro-F1 of 0.639, surpassing GMU (0.630) and all other listed multimodal baselines.
Results are consistent across all metrics, with CentralNet outperforming the best single-modality, early-fusion, late-fusion, ModDrop, and GMU benchmarks. Experimental runs are averaged over 64 trials, providing strong confidence in statistical significance.
| Dataset | CentralNet | Best Baseline Type | Best Baseline Value |
|---|---|---|---|
| Multimodal MNIST | 53% err | Early/Late/ModDrop/GMU | 56–68% accuracy |
| Audiovisual MNIST | 95.0% | Early/Late/ModDrop/GMU | 94.8% |
| Montalbano | 98.27% | ModDrop | 98.19% |
| MM-IMDb | 0.639 F1 | GMU | 0.630 |
6. Analysis of Fusion Weights and Fusion Schedules
Inspection of learned fusion weights (modalities) and (central) enables interpretation of preferred fusion strategies per task:
- Early fusion emerges when is strong, favoring low-level features.
- Late fusion is expressed via dominant at deeper layers.
- Hybrid schedules, with weights spread across depth, are observed on tasks like Montalbano.
- For MM-IMDb, early fusion dominates (low-level fusion of posters/text), with deeper layers shifting slightly toward textual inputs.
- AFEW prefers face-dominated early fusion, but central fusion gains prominence in the last layer.
- Audioset exhibits negligible central influence in early conv layers, increasing toward the classifier output.
This demonstrates that CentralNet adapts the modality-combination schedule contingent on intrinsic task requirements (Vielzeuf et al., 2018).
7. Significance and Comparative Context
CentralNet establishes a unified, flexible mechanism for multi-level fusion in multimodal deep neural networks. Its add-on central module does not mandate specific architectural restrictions on the unimodal branches, allowing plug-in use with recent advances in each modality domain. The empirical results indicate consistent, statistically significant improvements over state-of-the-art multimodal fusion techniques such as ModDrop and Gated Multimodal Units (GMU). The multi-objective regularizer is critical not only for maintaining unimodal performance but also for accelerating convergence during training. A plausible implication is that adaptive, learnable fusion depth can be leveraged broadly in multimodal or multi-sensor systems to automatically discover fusion schedules optimal for diverse domains (Vielzeuf et al., 2018, Vielzeuf et al., 2018).