SimBaMM: Simple Baseline for Multimodal Learning

Updated 4 January 2026

The paper introduces a closed-form probabilistic fusion method that computes optimal multimodal embeddings under conditional independence assumptions.
SimBaMM leverages a late-fusion Transformer architecture with per-modality encoders and token-based fusion to benchmark performance on large-scale multimodal tasks.
Empirical evaluations demonstrate competitive accuracy and efficiency across healthcare and sentiment datasets under rigorous, standardized experimental protocols.

Simple Baseline for Multimodal Learning (SimBaMM) designates two separate but thematically aligned contributions to multimodal machine learning: (1) a closed-form generative fusion scheme for multimodal utterance embeddings (Liang et al., 2019), and (2) a late-fusion Transformer architecture advanced as a benchmark in large-scale empirical studies (Rheude et al., 28 Dec 2025). Both approaches serve as strong baselines, emphasizing the effectiveness of simplicity and methodological rigor in comparison to increasingly complex multimodal models.

1. Precise Modeling Assumptions and Likelihood-Based Fusion

The probabilistic SimBaMM formulation (Liang et al., 2019) models each utterance $s$ by a unit-norm embedding $m_s \in \mathbb{R}^d$ . The model assumes conditional independence of modalities—words ( $w$ ), visual features ( $v$ ), and acoustic features ( $a$ )—given $m_s$ , with the total likelihood factorizing as

$P(s|m_s) = P(w|m_s)^{\alpha_w} \cdot P(v|m_s)^{\alpha_v} \cdot P(a|m_s)^{\alpha_a}$

where each $\alpha_\cdot$ is a learned or preset modality weight. Each modality likelihood is unimodal:

Language: For each word $w$ , likelihood employs Arora et al.-style smoothing,

$P(w|m_s) = \alpha\,p(w) + (1-\alpha)\,\frac{\exp(\langle w, m_s \rangle)}{Z_{m_s}}$

with $p(w)$ the corpus frequency and $Z_{m_s}$ a normalization.

Visual & Acoustic: Each per-feature dimension is modeled by a diagonal Gaussian,

$v(i)\,|\,m_s \sim \mathcal{N}(\mu_v(i), \sigma_v(i)^2)$

with $\mu_v(i) = W_v^\mu(i)\, m_s + b_v^\mu(i)$ , $\sigma_v(i) = \exp( W_v^\sigma(i)\, m_s + b_v^\sigma(i))$ . Acoustic features are handled analogously.

No explicit bimodal or trimodal factors are included in this supplement.

2. Closed-Form Fusion and Optimization

The main technical contribution is the analytic form of the optimal multimodal embedding under this likelihood model. Let $L(m_s)$ denote the total (approximated) log-likelihood,

$L(m_s) = \sum_{w \in \mathbf{w}} f_w(m_s) + \sum_{v(i)} f_{v(i)}(m_s) + \sum_{a(i)} f_{a(i)}(m_s)$

where $f_*(m_s)$ are weighted log-likelihood terms. Each $f_*(m_s)$ admits a first-order Taylor expansion around $m_s = 0$ , such that

$L(m_s) \approx C + \langle g, m_s \rangle$

with $g$ the sum of modality-weighted linear terms. Under the constraint $\|m_s\|_2 = 1$ , the maximizer is closed-form: $m_s^* = \frac{g}{\|g\|_2}$ Weights $\psi_{*}$ (arising from word frequency, modality-specific parameters, etc.) are analytic functions, not trained by backpropagation but computed once from hyperparameters or fitted parameters.

3. Transformer-Based Simple Baseline Architecture

In parallel, SimBaMM is defined as a simple, late-fusion Transformer architecture (Rheude et al., 28 Dec 2025), structured in four stages:

Per-modality Encoders: For each modality $m\in\{1,\ldots,M\}$ , a pretrained encoder $E_m$ transforms inputs $x_{k,m}$ into features $e_{k,m}\in \mathbb{R}^{d_m}$ . These features are projected into a unified space via trainable projections $P_m$ :

$\hat{e}_{k,m} = P_m(e_{k,m})\in \mathbb{R}^d$

Tokenization / Fusion Preparation: Variants include (a) multi-token per modality (e.g., patch-wise or temporal), or (b) per-modality “[CLS]” tokens, one per modality (SimBaMM^{CLS}), with sequence padded if modalities are missing.
Late-Fusion Transformer Head: The token sequence $S_k\in\mathbb{R}^{L\times d}$ undergoes $N$ Transformer encoder layers with full-attention, masked for missing modalities. Sinusoidal positional encodings precede the stack. Multihead attention is computed as:

$Q = SW_\ell^Q, \quad K = SW_\ell^K, \quad V = SW_\ell^V; \quad \text{head}_h = \text{Softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_h}} + \text{mask} \cdot (-\infty)\right) V_h$

with standard post-attention residuals and feed-forward sublayers.

Prediction Head: The first “[CLS]” token or the mean of the Transformer output tokens $H$ yields the pooled representation $h_k$ . Prediction is performed with a linear layer for classification or regression,

$\ell_k = W_g h_k + b_g$

followed by the appropriate activation and loss.

4. Hyperparameter Selection and Implementation Protocol

All models are implemented in PyTorch Lightning; base encoders are frozen, training only the projection, Transformer head, and output head parameters (Rheude et al., 28 Dec 2025). Critical hyperparameters include:

Number of Transformer layers $N$ (2, 4, 6, 8)
Hidden dimension $d$ (32, 64, 128, 256, 512)
Number of heads $H$ (4, 8, 16)
Feed-forward size (256–2048)
Dropout values (0.0, 0.1, 0.2)
Learning rate (log-uniform $[10^{-6}, 10^{-1}]$ )
Weight decay and warmup steps

Training protocol: binary or categorical cross-entropy loss, subject-wise five-fold cross-validation with mean ± SD reporting, batch size 32–64, early stopping, and rigorous Bayesian-optimized hyperparameter search (500 runs for SimBaMM head).

5. Empirical Results and Comparative Evaluation

Empirical evaluation across nine real-world datasets covers domains such as healthcare (HAIM, Symile, INSPECT, UKB) and multimodal sentiment/emotion recognition (MOSI, MOSEI, CH-SIMS, CH-SIMS2, Crema-D) (Rheude et al., 28 Dec 2025). SimBaMM^{CLS} and SimBaMM (multi-token) show performance metrics such as:

Healthcare AUROC: HAIM 0.6985, Symile 0.6318, INSPECT 0.6556, UKB 0.7957 (all mean ± SD).
Emotion accuracy: MOSI (7-class) 0.3229, MOSEI (7-class) 0.4936, CH-SIMS (5-class) 0.5086, CH-SIMS2 (5-class) 0.4351, Crema-D (6-class) 0.6723.

No complex architecture among 19 competing methods reliably outperforms SimBaMM under standardized hyperparameter search and protocol. Most models—including the strongest unimodal baselines—fall within the region of practical equivalence (±1% ROPE). For instance, in healthcare, a unimodal X-ray baseline matches or exceeds most multimodal results (HAIM: AUROC 0.7042). In small-data regimes (e.g., MOSI, CH-SIMS), architectural simplicity and careful tuning are sufficient for maximal performance.

Robustness to missing modalities is achieved via token masking in the Transformer, with SimBaMM matching the performance decline of explicit missing-modality strategies under up to 30% random missingness.

6. Computational Efficiency and Implementation Considerations

The closed-form probabilistic SimBaMM (Liang et al., 2019) achieves inference in $O((|w|+|v|+|a|)\cdot d)$ time—single-pass matrix–vector computations with no backpropagation through sequence or attention mechanisms, and analytic per-feature weights. Training involves only linear regression and coordinate-wise nonlinearities for the parameters $W_*, b_*$ .

Late-fusion SimBaMM (Rheude et al., 28 Dec 2025) is computationally efficient by decoupling base encoders and limiting learning to shallow Transformer heads and projections. No pretraining or data augmentation (other than specific cases) is applied. The minimal additional overhead relative to strong unimodal baselines underscores its utility as a robust benchmark for architectural ablations and ablation studies.

7. Methodological Insights and Benchmarking Recommendations

A critical methodological finding is that standardized data splits, unified optimizer selection, and subject-independent cross-validation are essential for replicable benchmarking. SimBaMM’s strong results derive from these rigorous practices rather than from fusion-specific design (Rheude et al., 28 Dec 2025). A pragmatic reliability checklist is proposed:

Use standardized experimental conditions (optimizer, initialization, splits, batch size)
Ensure base encoder parity across models
Always include simple late-fusion and unimodal baselines (SimBaMM)
Equally tune hyperparameters for all comparison methods
Employ subject/speaker-independent validation and report statistical uncertainty (mean ± SD)
Test across diverse datasets for generalization

A case study of a recent method (AUG, NeurIPS 2025) confirms that many previously reported gains vanish under rigorous, corrected evaluation. A plausible implication is that future progress in multimodal learning requires methodological rigor and robust evaluation frameworks over incremental architectural complexity.

Key references

Strong and Simple Baselines for Multimodal Utterance Embeddings (Liang et al., 2019)
Fusion or Confusion? Multimodal Complexity Is Not All You Need (Rheude et al., 28 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Strong and Simple Baselines for Multimodal Utterance Embeddings (2019)

Fusion or Confusion? Multimodal Complexity Is Not All You Need (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Simple Baseline for Multimodal Learning (SimBaMM).

SimBaMM: Simple Baseline for Multimodal Learning

1. Precise Modeling Assumptions and Likelihood-Based Fusion

2. Closed-Form Fusion and Optimization

3. Transformer-Based Simple Baseline Architecture

4. Hyperparameter Selection and Implementation Protocol

5. Empirical Results and Comparative Evaluation

6. Computational Efficiency and Implementation Considerations

7. Methodological Insights and Benchmarking Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SimBaMM: Simple Baseline for Multimodal Learning

1. Precise Modeling Assumptions and Likelihood-Based Fusion

2. Closed-Form Fusion and Optimization

3. Transformer-Based Simple Baseline Architecture

4. Hyperparameter Selection and Implementation Protocol

5. Empirical Results and Comparative Evaluation

6. Computational Efficiency and Implementation Considerations

7. Methodological Insights and Benchmarking Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research