CentralNet: Adaptive Multimodal Fusion

Updated 13 March 2026

CentralNet is a deep multimodal fusion architecture that adaptively integrates modality-specific networks using trainable fusion weights.
It bridges early and late fusion strategies via a central module, yielding significant accuracy improvements on diverse benchmarks.
The architecture leverages multi-task regularization and dynamic fusion depth selection to ensure robust performance across tasks.

CentralNet is a deep multimodal fusion architecture that introduces a trainable, multilayer central network responsible for adaptive integration of hidden representations from multiple modality-specific subnetworks. Its design provides a continuous interpolation between early and late fusion strategies, driven by learnable fusion weights at each network depth and regularized via multi-task learning objectives. CentralNet has been validated across diverse multimodal benchmarks, consistently yielding improved accuracy relative to state-of-the-art baselines by automatically selecting optimal fusion depths for each task (Vielzeuf et al., 2018, Vielzeuf et al., 2018).

1. Motivation and Overview

The challenge in multimodal deep learning lies in how and when to fuse independent modality streams—early (at feature level), late (at decision/score level), or mid-level. No universal best practice exists, as optimal fusion depth is highly task-dependent. Traditional approaches tend toward architectural locality, either concatenating features early on or aggregating predicted scores late. CentralNet proposes a central fusion module that interfaces with each unimodal network at every layer, combining their representations via learnable scalar weights. This realizes a multi-level fusion framework in which the network dynamically learns "how much" and "where" to fuse across modalities (Vielzeuf et al., 2018, Vielzeuf et al., 2018).

2. Architecture and Mathematical Formulation

Consider $K$ modality-specific networks $M^k$ ( $k=1,\ldots,K$ ), each generating hidden representations $h^k_i\in\mathbb{R}^{d_i}$ at depth $i$ ( $i=0,\ldots,L$ ), with $h^k_0$ as the input feature and $h^k_L$ as the logit vector. The central network $C$ is constructed with the same depth $L$ , maintaining its hidden state $M^k$ 0, initialized with $M^k$ 1 or by a weighted sum of $M^k$ 2.

At every layer $M^k$ 3, the central network fuses all $M^k$ 4 and its own previous state $M^k$ 5 by a learned weighted sum as follows: $M^k$ 6

$M^k$ 7

where $M^k$ 8 are trainable, scalar fusion weights associated respectively with the $M^k$ 9-th modality and the central net self-connection, $k=1,\ldots,K$ 0 is bias, $k=1,\ldots,K$ 1 is central layer weight matrix, and $k=1,\ldots,K$ 2 is a nonlinearity (usually ReLU).

At the output, the logits of all unimodal branches $k=1,\ldots,K$ 3 and the central branch $k=1,\ldots,K$ 4 can be fused by another weighted sum before the final softmax or sigmoid, thus realizing generalized late fusion as a special case.

3. Multi-Task Learning Objective and Regularization

CentralNet employs a multi-objective loss to ensure the central fusion improves over, but does not degrade, unimodal branch performance. Let $k=1,\ldots,K$ 5 denote the softmax output of the central net, and $k=1,\ldots,K$ 6 the unimodal outputs. With ground-truth $k=1,\ldots,K$ 7, central and unimodal cross-entropy losses are given as

$k=1,\ldots,K$ 8

The overall loss takes the form: $k=1,\ldots,K$ 9 In all reported experiments, $h^k_i\in\mathbb{R}^{d_i}$ 0, so the fusion and unimodal objectives are optimized jointly, regularizing the entire architecture and stabilizing training. For multilabel setups (MM-IMDb), a weighted binary cross-entropy with positive class weight $h^k_i\in\mathbb{R}^{d_i}$ 1 is applied for the central and unimodal outputs (Vielzeuf et al., 2018, Vielzeuf et al., 2018).

4. Implementation and Training Protocol

CentralNet is designed to wrap arbitrary modality-specific deep networks. In each branch, best-practice modules (e.g., LeNet-5 CNN for image, MLPs for audio/text) are used. Dense and convolutional layers are regularized via batch normalization and dropout (typically $h^k_i\in\mathbb{R}^{d_i}$ 2 on dense layers).

The recommended training protocol is:

(Optional) Pre-train each unimodal subnetwork independently using its own data and loss $h^k_i\in\mathbb{R}^{d_i}$ 3.
Initialize the central net’s parameters and all fusion weights.
Jointly fine-tune the entire architecture (central and all modality-specific branches) using Adam optimizer (default $h^k_i\in\mathbb{R}^{d_i}$ 4), optimizing $h^k_i\in\mathbb{R}^{d_i}$ 5 on all samples.
Use dataset-specific learning rates and schedule, apply early stopping where appropriate; apply batch-wise dropout and batch normalization to combat overfitting.

Typical hyperparameters include a batch size of $h^k_i\in\mathbb{R}^{d_i}$ 6 (or $h^k_i\in\mathbb{R}^{d_i}$ 7 for Montalbano), learning rates in $h^k_i\in\mathbb{R}^{d_i}$ 8, and 100 epochs (unless specified) (Vielzeuf et al., 2018, Vielzeuf et al., 2018).

5. Empirical Evaluation Across Multimodal Tasks

CentralNet was validated on four diverse benchmarks:

Multimodal MNIST (two artificial views): Each branch is LeNet-5; CentralNet achieves a 5% relative error reduction over best baselines (53% errors vs. 56–68% for late/early fusions, ModDrop, GMU).
Audiovisual MNIST (corrupted images + audio spectrograms): CentralNet yields 95.0% accuracy ( $h^k_i\in\mathbb{R}^{d_i}$ 9), outperforming the 94.8% baseline.
ChaLearn Montalbano (4-modal gesture recognition): Macro-accuracy for CentralNet is 98.27% ( $i$ 0)—a statistically significant improvement over ModDrop (98.19%), GMU (97.98%) and other fusions.
MM-IMDb (poster+plot genre multilabel): CentralNet attains a micro-F1 of 0.639, surpassing GMU (0.630) and all other listed multimodal baselines.

Results are consistent across all metrics, with CentralNet outperforming the best single-modality, early-fusion, late-fusion, ModDrop, and GMU benchmarks. Experimental runs are averaged over 64 trials, providing strong confidence in statistical significance.

Dataset	CentralNet	Best Baseline Type	Best Baseline Value
Multimodal MNIST	53% err	Early/Late/ModDrop/GMU	56–68% accuracy
Audiovisual MNIST	95.0%	Early/Late/ModDrop/GMU	94.8%
Montalbano	98.27%	ModDrop	98.19%
MM-IMDb	0.639 F1	GMU	0.630

6. Analysis of Fusion Weights and Fusion Schedules

Inspection of learned fusion weights $i$ 1 (modalities) and $i$ 2 (central) enables interpretation of preferred fusion strategies per task:

Early fusion emerges when $i$ 3 is strong, favoring low-level features.
Late fusion is expressed via dominant $i$ 4 at deeper layers.
Hybrid schedules, with weights spread across depth, are observed on tasks like Montalbano.
For MM-IMDb, early fusion dominates (low-level fusion of posters/text), with deeper layers shifting slightly toward textual inputs.
AFEW prefers face-dominated early fusion, but central fusion gains prominence in the last layer.
Audioset exhibits negligible central influence in early conv layers, increasing toward the classifier output.

This demonstrates that CentralNet adapts the modality-combination schedule contingent on intrinsic task requirements (Vielzeuf et al., 2018).

7. Significance and Comparative Context

CentralNet establishes a unified, flexible mechanism for multi-level fusion in multimodal deep neural networks. Its add-on central module does not mandate specific architectural restrictions on the unimodal branches, allowing plug-in use with recent advances in each modality domain. The empirical results indicate consistent, statistically significant improvements over state-of-the-art multimodal fusion techniques such as ModDrop and Gated Multimodal Units (GMU). The multi-objective regularizer is critical not only for maintaining unimodal performance but also for accelerating convergence during training. A plausible implication is that adaptive, learnable fusion depth can be leveraged broadly in multimodal or multi-sensor systems to automatically discover fusion schedules optimal for diverse domains (Vielzeuf et al., 2018, Vielzeuf et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

CentralNet: a Multilayer Approach for Multimodal Fusion (2018)

Multi-Level Sensor Fusion with Deep Learning (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CentralNet.