Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimBaMM: Simple Baseline for Multimodal Learning

Updated 4 January 2026
  • The paper introduces a closed-form probabilistic fusion method that computes optimal multimodal embeddings under conditional independence assumptions.
  • SimBaMM leverages a late-fusion Transformer architecture with per-modality encoders and token-based fusion to benchmark performance on large-scale multimodal tasks.
  • Empirical evaluations demonstrate competitive accuracy and efficiency across healthcare and sentiment datasets under rigorous, standardized experimental protocols.

Simple Baseline for Multimodal Learning (SimBaMM) designates two separate but thematically aligned contributions to multimodal machine learning: (1) a closed-form generative fusion scheme for multimodal utterance embeddings (Liang et al., 2019), and (2) a late-fusion Transformer architecture advanced as a benchmark in large-scale empirical studies (Rheude et al., 28 Dec 2025). Both approaches serve as strong baselines, emphasizing the effectiveness of simplicity and methodological rigor in comparison to increasingly complex multimodal models.

1. Precise Modeling Assumptions and Likelihood-Based Fusion

The probabilistic SimBaMM formulation (Liang et al., 2019) models each utterance ss by a unit-norm embedding msRdm_s \in \mathbb{R}^d. The model assumes conditional independence of modalities—words (ww), visual features (vv), and acoustic features (aa)—given msm_s, with the total likelihood factorizing as

P(sms)=P(wms)αwP(vms)αvP(ams)αaP(s|m_s) = P(w|m_s)^{\alpha_w} \cdot P(v|m_s)^{\alpha_v} \cdot P(a|m_s)^{\alpha_a}

where each α\alpha_\cdot is a learned or preset modality weight. Each modality likelihood is unimodal:

  • Language: For each word ww, likelihood employs Arora et al.-style smoothing,

P(wms)=αp(w)+(1α)exp(w,ms)ZmsP(w|m_s) = \alpha\,p(w) + (1-\alpha)\,\frac{\exp(\langle w, m_s \rangle)}{Z_{m_s}}

with p(w)p(w) the corpus frequency and ZmsZ_{m_s} a normalization.

  • Visual & Acoustic: Each per-feature dimension is modeled by a diagonal Gaussian,

v(i)msN(μv(i),σv(i)2)v(i)\,|\,m_s \sim \mathcal{N}(\mu_v(i), \sigma_v(i)^2)

with μv(i)=Wvμ(i)ms+bvμ(i)\mu_v(i) = W_v^\mu(i)\, m_s + b_v^\mu(i), σv(i)=exp(Wvσ(i)ms+bvσ(i))\sigma_v(i) = \exp( W_v^\sigma(i)\, m_s + b_v^\sigma(i)). Acoustic features are handled analogously.

No explicit bimodal or trimodal factors are included in this supplement.

2. Closed-Form Fusion and Optimization

The main technical contribution is the analytic form of the optimal multimodal embedding under this likelihood model. Let L(ms)L(m_s) denote the total (approximated) log-likelihood,

L(ms)=wwfw(ms)+v(i)fv(i)(ms)+a(i)fa(i)(ms)L(m_s) = \sum_{w \in \mathbf{w}} f_w(m_s) + \sum_{v(i)} f_{v(i)}(m_s) + \sum_{a(i)} f_{a(i)}(m_s)

where f(ms)f_*(m_s) are weighted log-likelihood terms. Each f(ms)f_*(m_s) admits a first-order Taylor expansion around ms=0m_s = 0, such that

L(ms)C+g,msL(m_s) \approx C + \langle g, m_s \rangle

with gg the sum of modality-weighted linear terms. Under the constraint ms2=1\|m_s\|_2 = 1, the maximizer is closed-form: ms=gg2m_s^* = \frac{g}{\|g\|_2} Weights ψ\psi_{*} (arising from word frequency, modality-specific parameters, etc.) are analytic functions, not trained by backpropagation but computed once from hyperparameters or fitted parameters.

3. Transformer-Based Simple Baseline Architecture

In parallel, SimBaMM is defined as a simple, late-fusion Transformer architecture (Rheude et al., 28 Dec 2025), structured in four stages:

  1. Per-modality Encoders: For each modality m{1,,M}m\in\{1,\ldots,M\}, a pretrained encoder EmE_m transforms inputs xk,mx_{k,m} into features ek,mRdme_{k,m}\in \mathbb{R}^{d_m}. These features are projected into a unified space via trainable projections PmP_m:

e^k,m=Pm(ek,m)Rd\hat{e}_{k,m} = P_m(e_{k,m})\in \mathbb{R}^d

  1. Tokenization / Fusion Preparation: Variants include (a) multi-token per modality (e.g., patch-wise or temporal), or (b) per-modality “[CLS]” tokens, one per modality (SimBaMM{CLS}), with sequence padded if modalities are missing.
  2. Late-Fusion Transformer Head: The token sequence SkRL×dS_k\in\mathbb{R}^{L\times d} undergoes NN Transformer encoder layers with full-attention, masked for missing modalities. Sinusoidal positional encodings precede the stack. Multihead attention is computed as:

Q=SWQ,K=SWK,V=SWV;headh=Softmax(QhKhTdh+mask())VhQ = SW_\ell^Q, \quad K = SW_\ell^K, \quad V = SW_\ell^V; \quad \text{head}_h = \text{Softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_h}} + \text{mask} \cdot (-\infty)\right) V_h

with standard post-attention residuals and feed-forward sublayers.

  1. Prediction Head: The first “[CLS]” token or the mean of the Transformer output tokens HH yields the pooled representation hkh_k. Prediction is performed with a linear layer for classification or regression,

k=Wghk+bg\ell_k = W_g h_k + b_g

followed by the appropriate activation and loss.

4. Hyperparameter Selection and Implementation Protocol

All models are implemented in PyTorch Lightning; base encoders are frozen, training only the projection, Transformer head, and output head parameters (Rheude et al., 28 Dec 2025). Critical hyperparameters include:

  • Number of Transformer layers NN (2, 4, 6, 8)
  • Hidden dimension dd (32, 64, 128, 256, 512)
  • Number of heads HH (4, 8, 16)
  • Feed-forward size (256–2048)
  • Dropout values (0.0, 0.1, 0.2)
  • Learning rate (log-uniform [106,101][10^{-6}, 10^{-1}])
  • Weight decay and warmup steps

Training protocol: binary or categorical cross-entropy loss, subject-wise five-fold cross-validation with mean ± SD reporting, batch size 32–64, early stopping, and rigorous Bayesian-optimized hyperparameter search (500 runs for SimBaMM head).

5. Empirical Results and Comparative Evaluation

Empirical evaluation across nine real-world datasets covers domains such as healthcare (HAIM, Symile, INSPECT, UKB) and multimodal sentiment/emotion recognition (MOSI, MOSEI, CH-SIMS, CH-SIMS2, Crema-D) (Rheude et al., 28 Dec 2025). SimBaMM{CLS} and SimBaMM (multi-token) show performance metrics such as:

  • Healthcare AUROC: HAIM 0.6985, Symile 0.6318, INSPECT 0.6556, UKB 0.7957 (all mean ± SD).
  • Emotion accuracy: MOSI (7-class) 0.3229, MOSEI (7-class) 0.4936, CH-SIMS (5-class) 0.5086, CH-SIMS2 (5-class) 0.4351, Crema-D (6-class) 0.6723.

No complex architecture among 19 competing methods reliably outperforms SimBaMM under standardized hyperparameter search and protocol. Most models—including the strongest unimodal baselines—fall within the region of practical equivalence (±1% ROPE). For instance, in healthcare, a unimodal X-ray baseline matches or exceeds most multimodal results (HAIM: AUROC 0.7042). In small-data regimes (e.g., MOSI, CH-SIMS), architectural simplicity and careful tuning are sufficient for maximal performance.

Robustness to missing modalities is achieved via token masking in the Transformer, with SimBaMM matching the performance decline of explicit missing-modality strategies under up to 30% random missingness.

6. Computational Efficiency and Implementation Considerations

The closed-form probabilistic SimBaMM (Liang et al., 2019) achieves inference in O((w+v+a)d)O((|w|+|v|+|a|)\cdot d) time—single-pass matrix–vector computations with no backpropagation through sequence or attention mechanisms, and analytic per-feature weights. Training involves only linear regression and coordinate-wise nonlinearities for the parameters W,bW_*, b_*.

Late-fusion SimBaMM (Rheude et al., 28 Dec 2025) is computationally efficient by decoupling base encoders and limiting learning to shallow Transformer heads and projections. No pretraining or data augmentation (other than specific cases) is applied. The minimal additional overhead relative to strong unimodal baselines underscores its utility as a robust benchmark for architectural ablations and ablation studies.

7. Methodological Insights and Benchmarking Recommendations

A critical methodological finding is that standardized data splits, unified optimizer selection, and subject-independent cross-validation are essential for replicable benchmarking. SimBaMM’s strong results derive from these rigorous practices rather than from fusion-specific design (Rheude et al., 28 Dec 2025). A pragmatic reliability checklist is proposed:

  1. Use standardized experimental conditions (optimizer, initialization, splits, batch size)
  2. Ensure base encoder parity across models
  3. Always include simple late-fusion and unimodal baselines (SimBaMM)
  4. Equally tune hyperparameters for all comparison methods
  5. Employ subject/speaker-independent validation and report statistical uncertainty (mean ± SD)
  6. Test across diverse datasets for generalization

A case study of a recent method (AUG, NeurIPS 2025) confirms that many previously reported gains vanish under rigorous, corrected evaluation. A plausible implication is that future progress in multimodal learning requires methodological rigor and robust evaluation frameworks over incremental architectural complexity.


Key references

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Simple Baseline for Multimodal Learning (SimBaMM).