CMAC-MMD: Cross-Modal Alignment

Updated 24 December 2025

CMAC-MMD is a framework that enforces global semantic and distributional consistency across heterogeneous modality embeddings via a kernel-based Maximum Mean Discrepancy metric.
It complements local alignment methods like Optimal Transport by integrating global MMD-based regularization to mitigate distributional gaps in multimodal architectures.
Empirical studies reveal that incorporating CMAC-MMD yields measurable performance improvements and reduced modality interference in models such as AlignMamba and DecAlign.

Cross-Modal Alignment Consistency with Maximum Mean Discrepancy (CMAC-MMD) is a framework for enforcing semantic and distributional agreement between feature representations of heterogeneous modalities in multimodal machine learning architectures. By leveraging kernel-based distance metrics, specifically Maximum Mean Discrepancy (MMD), CMAC-MMD aligns modality-specific embeddings within a shared latent space, enabling robust representation fusion and mitigating distributional gaps. CMAC-MMD provides global alignment complementary to local, token-level matching methods such as Optimal Transport (OT), and constitutes a foundational component in recent hierarchical and efficient multimodal models, including AlignMamba (Li et al., 1 Dec 2024) and DecAlign (Qian et al., 14 Mar 2025).

1. Theoretical Foundations and Motivation

The heterogeneity of sensory modalities—such as vision, audio, and language—induces intrinsic differences in the statistical structure and geometry of their respective feature spaces. While local alignment techniques (e.g., OT) address temporal and fine-grained correspondences, they do not guarantee global, distribution-level coherence. Cross-Modal Alignment Consistency (CMAC) seeks to enforce that embeddings representing the same semantics, regardless of modality, share similar distributional properties.

MMD is central to CMAC-MMD as it is a non-parametric, kernel-based two-sample test that measures discrepancy between two empirical distributions. The feature map $\phi:\mathbb{R}^d \rightarrow \mathcal{H}$ projects input vectors to a reproducing-kernel Hilbert space (RKHS), where the squared MMD is computed as: $\mathrm{MMD}^2(X,Y) = \left\| \frac{1}{n}\sum_{i=1}^n \phi(x_i) - \frac{1}{m}\sum_{j=1}^m \phi(y_j) \right\|_{\mathcal{H}}^2,$ which can be expanded as a sum over kernel evaluations: $\mathrm{MMD}^2(X,Y) = \frac{1}{n^2} \sum_{i,i'} k(x_i, x_{i'}) + \frac{1}{m^2} \sum_{j,j'} k(y_j, y_{j'}) - \frac{2}{nm} \sum_{i,j} k(x_i, y_j).$ The kernel $k(\cdot, \cdot)$ is typically Gaussian (RBF), enabling implicit comparison of infinite-dimensional moments without density estimation (Li et al., 1 Dec 2024, Qian et al., 14 Mar 2025).

2. Formulation and Integration into Modern Architectures

CMAC-MMD is implemented after local alignment stages via OT or similar modules. Consider two or more modalities (e.g., audio $X_a$ , video $X_v$ , text $X_l$ ). After OT-based alignment, each non-anchor modality is temporally synchronized to the anchor (typically text), yielding $\tilde X_a, \tilde X_v \in \mathbb{R}^{T_l \times d}$ . The global alignment loss is then formulated as a sum of MMDs between aligned modalities and the anchor: $\mathcal{L}_{\mathrm{align}} = \mathrm{MMD}^2(\tilde X_v, X_l) + \mathrm{MMD}^2(\tilde X_a, X_l)$

In the case of $M$ modalities and decoupled architecture as in DecAlign, CMAC-MMD regularization is applied across all pairs of modality-common embedding sets $\mathcal{Z}_{\mathrm{com}}^{m_i}$ : $\mathcal{L}_{\mathrm{MMD}} = \frac{2}{M(M-1)} \sum_{1 \le i < j \le M} \mathrm{MMD}^2\left(\mathcal{Z}_{\mathrm{com}}^{m_i}, \mathcal{Z}_{\mathrm{com}}^{m_j}\right)$ (Li et al., 1 Dec 2024, Qian et al., 14 Mar 2025).

The overall loss function deploys a combination of downstream task loss (e.g., cross-entropy or MSE) and CMAC-MMD: $\mathcal{L} = \mathcal{L}_{\mathrm{task}} + \lambda\,\mathcal{L}_{\mathrm{align}}$ with $\lambda$ as a hyperparameter governing alignment strength.

3. Computational and Implementation Aspects

MMD computation is $O(T_l^2)$ per modality pair, feasible for $T_l$ up to several hundred. Beyond this, linear-time kernel approximation (e.g., random Fourier features) or token subsampling is used. The median heuristic is effective for kernel bandwidth selection: $\sigma^2$ is set to the median squared pairwise distance within a held-out batch. Unbiased U-statistic estimators of MMD are employed to mitigate bias, with empirically determined $\lambda \approx 0.1{-}1.0$ yielding stable trade-offs between task and alignment objectives.

Stability is further enhanced by gradient clipping ( $\lVert g \rVert_2 \le 1.0$ ), mild weight decay on $\sigma$ , and decoupled learning rates for backbone and alignment modules (Li et al., 1 Dec 2024). CMAC-MMD terms are evaluated per training step on each mini-batch and can be extended to any number of modalities or alternative kernels (e.g., mixtures of RBFs).

4. Empirical Evaluation and Ablation Studies

Ablation studies in AlignMamba demonstrate that removing the CMAC-MMD term leads to measurable degradation in performance. For instance, on CMU-MOSI, the full model achieves 86.9% accuracy, which drops to 85.8% without the global MMD loss (–1.1%). The effect remains consistent on CMU-MOSEI (full: 86.6%, w/o MMD: 85.7%, –0.9%). Omission of local OT yields even larger drops. Importantly, qualitative feature visualization (t-SNE) shows compact clustering of aligned features post-CMAC-MMD, while the A-distance between modalities is reduced (e.g., 1.57→1.49 for video-text on MOSI), indicating better distributional alignment (Li et al., 1 Dec 2024).

Similarly, DecAlign applies MMD on modality-common features, observing small but consistent improvements: for example, MAE drops from 0.741 to 0.738 and F1 rises from 84.61 to 84.73 on CMU-MOSI when MMD is included, confirming alignment regularization enhances semantic consistency (Qian et al., 14 Mar 2025).

Model/Setting	Task	Full (MMD)	w/o MMD	Drop (%)
AlignMamba (CMU-MOSI)	Accuracy	86.9	85.8	–1.1
AlignMamba (CMU-MOSEI)	Accuracy	86.6	85.7	–0.9
DecAlign (CMU-MOSI)	F1	84.73	84.61	–0.12
DecAlign (CMU-MOSEI)	F1	85.33	85.26	–0.07

This demonstrates cross-model, architecture-agnostic improvements from CMAC-MMD.

In hierarchical frameworks such as DecAlign, CMAC-MMD operates as the upper tier of cross-modal alignment. The architecture first decouples modality-unique and modality-common branches. Local heterogeneity alignment is executed via Gaussian-mixture prototype-guided multi-marginal OT, ensuring preservation of fine-grained, modality-dependent characteristics. The global, semantic alignment then applies MMD-based regularization to the modality-common branches, compelling their latent representations to share identical high-order statistics and subspace geometry. This multilevel process stabilizes multimodal fusion, minimizing semantic drift and modality interference (Qian et al., 14 Mar 2025).

A plausible implication is that hierarchical application of CMAC-MMD can be extended seamlessly to scenarios with arbitrary modality count, as the pairwise MMD objective generalizes to multiple branches.

6. Practical Implementation Guidance

For reproducibility, CMAC-MMD implementation requires only (i) local alignment modules producing temporally synchronized embeddings, (ii) batchwise MMD computation over aligned feature sets, and (iii) proper tuning of kernel bandwidth and $\lambda$ . MMD can be computed via existing kernel_mmd functions in modern ML libraries, and for long-sequence tasks, linear-time estimators or feature subsampling suffice to retain scalability.

Practical workflow recommendations include:

Apply MMD post-local alignment on each training mini-batch.
Tune the RBF kernel bandwidth $\sigma$ using the median heuristic on a randomly sampled batch.
Choose $\lambda$ through grid search in $\{0.01, 0.1, 1.0\}$ .
For $T_l > 500$ , subsample tokens or employ random Fourier feature approximations.
Jointly monitor both task and alignment losses to maintain training balance (Li et al., 1 Dec 2024).

By following these guidelines and referencing open-source implementations (e.g., DecAlign), CMAC-MMD can be ported to new fusion backbones or modalities with minimal adaptation.

7. Significance and Broader Impact

The incorporation of CMAC-MMD has demonstrated consistent quantitative and qualitative improvements in multimodal learning for affect recognition, sentiment analysis, and broader human-computer interaction tasks. By enforcing distributional alignment of modality-common representations and supplementing local, correspondence-aware alignment strategies, CMAC-MMD increases model robustness, reduces spurious modality-specific influence, and improves semantic transfer. Its kernel-based formulation provides flexibility for integration with future architectures and for handling arbitrary modality sets, underpinning scalable, uncertainty-resilient multimodal learning (Li et al., 1 Dec 2024, Qian et al., 14 Mar 2025).