Class-Former-Aided Modality Alignment (CMA)

Updated 19 December 2025

Class-Former-Aided Modality Alignment is a mechanism that aligns heterogeneous modality features by summarizing per-modality descriptors and enforcing semantic consistency.
It employs trainable class-level queries and multihead cross-attention to aggregate and refine features, ensuring robust fusion even with missing modality data.
Empirical results demonstrate that CMA improves accuracy and segmentation performance, making multimodal transformers more resilient under diverse conditions.

Class-Former-Aided Modality Alignment (CMA) refers to a class of regularization and alignment mechanisms designed for multimodal transformer architectures, facilitating semantic consistency and robustness when fusing heterogeneous modalities—particularly in scenarios with missing or partial modality data. CMA operates by summarizing per-modality feature sets and explicitly aligning modality-specific class descriptors in a unified semantic space, either during training (as in AMBER (Wen et al., 12 Dec 2025)) or as a refinement stage for query representations (as in BiXFormer (Chen et al., 4 Jun 2025)).

1. Core Principles and Architectural Role

CMA addresses challenges inherent to multimodal learning, where different modalities (e.g., image, LiDAR, radar, GPS) produce features residing in heterogeneous latent spaces. Without deliberate alignment, fusion mechanisms can collapse or shortcut semantic representations, reducing cross-modal robustness and performance, especially under missing-modality conditions. CMA introduces trainable class-level queries for each modality, applies cross-attention to aggregate these into condensed descriptors, and enforces semantic alignment via contrastive or adversarial losses. In AMBER, CMA functions as a training-time regularizer, processing outputs after modality-specific and global fusion blocks; in BiXFormer, CMA refines weak mask-classification queries (from Complementary Matching, CM) by aligning them to optimally-matched queries (from Modality-Agnostic Matching, MAM).

2. Mathematical Formulation and Operational Workflow

In AMBER, let $\bar Z_j \in \mathbb{R}^{L_j \times C}$ be the refined features for modality $j$ , and $\bar Z_F$ the global fused features. CMA maintains a trainable query $c_j \in \mathbb{R}^{1 \times C}$ for each modality and $c_F$ for fusion, initialized randomly. These are updated via multihead cross-attention:

$Q_{j,h} = c_j W_h^Q, \quad K_{j,h} = \bar Z_j W_h^K, \quad V_{j,h} = \bar Z_j W_h^V$

$\text{head}_{j,h} = \mathrm{softmax}\left(\frac{Q_{j,h} K_{j,h}^T}{\sqrt{d_k}}\right) V_{j,h}$

$c_j \leftarrow \mathrm{Concat}[\text{head}_{j,1}, \dots, \text{head}_{j,H}] W^O$

Cross-modal semantic alignment is enforced by a supervised contrastive loss:

$L_c = \frac{1}{|\mathcal{A}|} \sum_{j \in \mathcal{A}} \left[ -\log \frac{\exp(\mathrm{sim}(c_j, c_F)/\tau)}{\sum_{i \in \mathcal{A}} \exp(\mathrm{sim}(c_j, c_i)/\tau)} \right]$

where $\mathrm{sim}(u,v) = u^T v / (\|u\|\|v\|)$ and $\mathcal{A}$ is the set of available modalities.

In BiXFormer, CMA processes the outputs of UMM. It encodes strong queries $Q_r^a$ / $Q_x^a$ via a shared encoder $E$ to obtain latent codes $z_r$ / $z_x$ , and refines the weak queries $Q_r^s$ / $Q_x^s$ via a decoder $D$ using the cross-modal latent:

$\tilde Q_x^s = D(Q_x^s, z_r), \quad \tilde Q_r^s = D(Q_r^s, z_x)$

The alignment loss is defined as:

$\mathcal{L}_a = \|Q_r^a - \tilde Q_x^s\|_2^2 + \mathrm{MMD}(Q_r^a, \tilde Q_x^s) + \|Q_x^a - \tilde Q_r^s\|_2^2 + \mathrm{MMD}(Q_x^a, \tilde Q_r^s)$

This enforces both intra-class compactness (via MSE) and inter-modality separation (via MMD).

3. Handling Missing Modalities

CMA explicitly incorporates missing-modality awareness. In AMBER, modality-specific blocks include a binary availability mask $m_j \in \{0, 1\}$ ; if $m_j=0$ , features $\bar Z_j$ and descriptors $c_j$ are skipped during loss calculation, preventing noise propagation and spurious cross-modal alignment. The contrastive loss sums only over available modalities $\mathcal{A} = \{j : m_j = 1\}$ , ensuring robustness when modalities are absent. A similar principle applies in BiXFormer, where the refinement mechanism elevates the weaker queries from missing or suboptimal modalities by leveraging strong semantic centers learned from available streams.

4. Temporal Coherence and Positional Embedding

AMBER’s variant of CMA incorporates temporally-aware positional embeddings. Prior to transformer processing, each token—spatial or fusion—is augmented with spatial and temporal sine/cosine embeddings:

$\xi \leftarrow \xi + \mathrm{PE}_{\text{spatial}(x, y)} + \mathrm{PE}_{\text{temporal}(t)}$

Consequently, CMA class tokens aggregate temporally indexed semantics via cross-attention over transformer outputs, yielding temporally-consistent and modality-invariant representations across input windows.

5. Training Objectives and Inference Regimes

The total AMBER training objective combines beam-prediction focal loss ( $L_f$ ), CMA contrastive alignment loss ( $L_c$ ), and modality reweighting regularizer ( $L_2$ ):

$L = \lambda_f L_f + \lambda_c L_c + \lambda_r L_2$

In BiXFormer, the segmentation loss ( $\mathcal{L}_{\text{seg}}$ ) is augmented with CMA’s alignment loss weighted by $\lambda$ :

$\mathcal{L} = \mathcal{L}_{\text{seg}} + \lambda \mathcal{L}_a$

During inference, AMBER disables the CMA module, using only the learned fusion mechanism and final beam prediction head. BiXFormer replaces original weak queries with their CMA-refined counterparts prior to classification and mask prediction.

6. Empirical Performance and Impact

Ablation studies in AMBER demonstrate that adding CMA to positional embeddings and reweight indicators yields measurable improvements: for full-modality, Top-3 accuracy increases by 0.59% (88.48% → 89.07%), and for missing-two-modalities, Top-3 rises by 0.67% (86.98% → 87.31%). Under severe missing-modality regimes (e.g., up to 75% random masking of one modality), CMA-equipped AMBER maintains Top-1 accuracy at 62.3%, outperforming LSTM baselines by ~4.6%. When up to three modalities are missing (each at 50%), CMA enables degradation in Top-1 accuracy of only ~3.2%, compared to 13.5% for multimodal-LSTM (Wen et al., 12 Dec 2025).

In BiXFormer, the addition of CMA to the two-backbone UMM pipeline (T+U+A) increases mean mIoU from 54.47% to 55.39%. Refiner variants reveal that CMA with a VAE-based refiner achieves the best trade-off between modality and class distance (modality 1.03, class 0.49), yielding highest mIoU at 55.39%, outperforming naïve alignment and MLP-based alternatives. CMA consistently strengthens weaker CM queries, steering them towards the semantic centers uncovered in MAM and providing robust multimodal segmentation, especially where modalities are missing or degraded (Chen et al., 4 Jun 2025).

7. Synthesis and Significance

Class-Former-Aided Modality Alignment provides a principled mechanism for cross-modal semantic regularization in multimodal transformer systems, yielding modality-invariant representations while preserving modality-unique detail. Its application mitigates modality shortcutting and latent space collapse, substantially improving both peak accuracy and robustness under arbitrary missing modalities. The module entails marginal computational overhead, is deployable as a training-time regularizer or as an inference-time query refiner, and integrates seamlessly with modern transformer-based fusion architectures. A plausible implication is that future multimodal benchmarks will increasingly adopt similar alignment strategies, particularly for applications requiring adaptive fusion and robust prediction with incomplete data.