Learnable Modality Tokens

Updated 29 September 2025

Learnable modality tokens are trainable embeddings that encode modality-specific signals, enabling precise multimodal fusion and improved cross-modal alignment.
They are strategically integrated in transformer architectures using self-attention and cross-attention to efficiently aggregate diverse modality data.
These tokens drive robust performance in applications like radiology, autonomous driving, and object re-identification, even under missing or imbalanced modalities.

Learnable modality tokens are trainable representations that directly encode modality-specific information (such as vision, text, audio, speech, tabular, or LiDAR) and are integrated into transformer-based architectures to enhance multimodal fusion, cross-modal alignment, and robustness to missing or imbalanced modalities. These tokens may be appended, injected, or adaptively generated within the model and can be supervised, adapted on-the-fly, or learned end-to-end, facilitating more precise and efficient multimodal learning in domains such as natural language, vision, robotics, speech, and medical analysis.

1. Architectural Placement and Formulation

Learnable modality tokens are introduced at strategic locations in transformer architectures, typically as appended tokens within input sequences or separate query blocks. In BREEN (Li et al., 16 Mar 2025), learnable query tokens are placed between image patch embeddings and text tokens, acting as a semantic bridge supervised by CLIP-derived visual features. In DeepMLF (Georgiou et al., 15 Apr 2025), fusion tokens are appended to LLM inputs and undergo repeated cross-attention with audiovisual features. MoMa (Lin et al., 31 Jul 2024) partitions experts into modality-specific groups, with routing determined by a gating function on the input token’s semantic content.

Mathematically, modality tokens are often formulated as learnable embeddings or vectors:

For METransformer (Wang et al., 2023), expert tokens $x_e \in \mathbb{R}^{M \times D}$ are concatenated with image tokens and passed through multi-head self-attention and bilinear pooling.
In MMoT (Zheng et al., 2023), the adaptive fusion uses Mixer $(X, \mathcal{C}) = \text{softmax}((P \cdot F^T)/\sqrt{D})F$ , where $P$ is a learnable [PULSE] modality token.
For modality-missing scenarios (Gu et al., 22 Sep 2025), learnable tokens $E_{c}$ (for image) and $E_{t}$ (for tabular data) replace zero matrices to represent missing modalities.

The learnability and placement of these tokens dictate their ability to encode discrete or continuous modality signals, aggregate cross-modal semantics, and adapt fusion strategies dynamically.

2. Interaction Mechanisms and Information Exchange

Modality tokens interface with other tokens (patch, word, instruction, or expert tokens) via various transformer mechanisms, including self-attention, cross-attention, bilinear attention, and specialized routing.

METransformer (Wang et al., 2023) allows expert tokens to attend to image regions and each other in a ViT encoder, with orthogonal loss promoting diversity and complementarity.
MMoT (Zheng et al., 2023) leverages modality tokens for region-wise fusion, where adaptive weighting supports differentiated control signal strength per spatial location.
In DeepMLF (Georgiou et al., 15 Apr 2025), fusion tokens accumulate linguistic signals via causal self-attention in LM blocks and cross-modal information through gated cross-attention in MM blocks, iteratively refining multimodal representations with layer depth.
MoMa (Lin et al., 31 Jul 2024) achieves modality-aware and instance-adaptive computation by routing input tokens first to modality groups and subsequently to experts within each group using learned gating:

$y = \sum_{j=1}^{m} g^T(x)_j \cdot FFN^T_j(x)\quad\text{if }x \in T,$

where $g^M(x)_j = \text{Sigmoid}(x \cdot W^M_g)_j$ .

This orchestrated information exchange enables tokens to capture multimodal dependencies while preserving task-specific or modality-specific granularity.

3. Training Objectives, Losses, and Supervision

Learning modality tokens may involve explicit supervision, alignment losses, auxiliary objectives, and dynamic optimization procedures.

In BREEN (Li et al., 16 Mar 2025), the alignment loss

$L_{\text{align}} = 1 - \frac{Q_{\text{out}} \cdot V_{\text{CLIP}}}{\|Q_{\text{out}}\| \| V_{\text{CLIP}} \|},$

encourages the queries to approximate CLIP’s visual semantics at both fine and coarse levels.

DeepMLF (Georgiou et al., 15 Apr 2025) combines task losses for sentiment analysis (LM negative log-likelihood), auxiliary modality-specific losses, and language modeling regularization, with controlled gating mechanisms to modulate fusion.
Directed tokens (Truong et al., 19 Aug 2025) are trained to reconstruct shuffled image/text orders with

$\theta^* = \arg\max_\theta \mathbb{E}_{k, \bar{x}, x_{\text{instruct}}^{(\text{image})}} \log p(k\,|\,\bar{x}, x_{\text{instruct}}^{(\text{image})}),$

employing supervised learning signals to enforce robust cross-modal relationships.

ControlMLLM (Wu et al., 31 Jul 2024) introduces a test-time optimization of a latent variable $p_v$ added to visual tokens, guided by an energy function:

$E(A^{ct}, r) = \left(1 - \frac{\sum_{i \in r} A_i^{ct}}{\sum_i A_i^{ct}}\right)^2,$

and updated by gradient descent.

These mechanisms enrich the modalities’ representational capacity and facilitate fine-grained, task-centric alignment.

4. Applications in Multimodal Learning

Learnable modality tokens have demonstrated efficacy across a broad range of multimodal applications:

Radiology report generation (METransformer (Wang et al., 2023)): Expert tokens specialize in distinct image regions, with orthogonality and voting strategies yielding higher CIDEr and BLEU scores in complex diagnostic scenarios.
Multimodal conditional image synthesis (MMoT (Zheng et al., 2023)): Modality tokens enable adaptive fusion for text, sketch, layout, and segmentation with superior FID and Inception Scores.
Autonomous driving (Prompting Multi-Modal Tokens (Duan et al., 7 Apr 2024)): Multi-modality tokens encode fused LiDAR and visual inputs for integrated planning and control, outperforming pure-language and perception-pipeline alternatives in CARLA evaluations.
Object re-identification (Magic Tokens (Zhang et al., 15 Mar 2024)): Dynamic selection via spatial-frequency modules yields more discriminative representations for RGB, NIR, and TIR, boosting mAP and Rank-1 metrics.
Encoder-free multimodal models (BREEN (Li et al., 16 Mar 2025)): Learnable queries, distilled from CLIP, reduce data requirements by $\sim1\%$ , and match or improve upon state-of-the-art benchmarks such as MMMU and MMStar.

A plausible implication is that learnable modality tokens facilitate state-of-the-art performance even with significantly reduced training data and parameter counts, provided strong supervision and architectural hardware-awareness.

5. Robustness, Efficiency, and Handling Missing Modalities

Recent advances focus on robustness to missing or imbalanced modalities via improved dropout and fusion strategies.

Modality dropout with learnable tokens (Gu et al., 22 Sep 2025) replaces zero matrices with trainable $E_c$ and $E_t$ tokens, allowing the fusion module to adaptively represent “missingness.” This leads to improved AUROC and probability estimation in disease detection, even when only a single modality is present.
Contrastive losses align unimodal and fused modal representations, further mitigating modality imbalance and enhancing generalization, essential for clinical and real-world settings.
In LeMeViT (Jiang et al., 16 May 2024), learnable meta tokens produce sparse, information-rich summaries, achieving a $1.7\times$ speedup in remote sensing tasks while maintaining competitive accuracy.
In continuous speech modeling (Yuan et al., 6 Dec 2024), replacing discrete tokens with flow-matched continuous tokens avoids quantization loss, resulting in lower WER and improved multi-modality learning robustness.

The trend toward trainable modality tokens suggests highly scalable, resource-efficient, and robust multimodal architectures suitable for deployment in incomplete or challenging information environments.

Modality tokens also serve as mediators for cross-modal alignment, improving reasoning by enforcing structural and semantic coherence.

Directed token placement and order reconstruction tasks (Truong et al., 19 Aug 2025) enhance the model’s ability to use both visual and textual context for reconstructing shuffled image/text order, measured via permutation accuracy and attention scores.
ImageBind-LLM (Han et al., 2023) employs a learnable bind network with zero-initialized gating to inject visual, audio, and 3D point cloud features into LLM layers, leveraging a unified embedding space for instruction following across modalities.
Multimodal prompt tuning with optimal transport (Wang et al., 2023) employs hierarchical transportation problems to align multi-mode token sets, achieving improved few-shot accuracy and generalization.

A plausible implication is that explicit structural supervision (e.g., reconstruction tasks, contrastive alignment) in conjunction with modality tokens is critical for building models with strong cross-modal reasoning and reduced hallucination.

7. Future Directions and Scalability Considerations

Research in learnable modality tokens points toward several future directions:

Exploring deeper fusion (e.g., via multiple MM blocks as in DeepMLF (Georgiou et al., 15 Apr 2025)) and dedicated multimodal token capacity as a driver of improved performance.
Leveraging knowledge distillation from strong pretrained vision models (e.g., CLIP supervision in BREEN (Li et al., 16 Mar 2025)) to enable parameter- and data-efficient multimodal learning.
Scaling to larger model and data regimes and extending token design to non-discrete, continuous representations (Flow-Omni (Yuan et al., 6 Dec 2024)) or recursive discrete semantics (DDT tokens (Pan et al., 20 Apr 2025)).
Developing robust mechanisms (e.g., modality dropout with adaptive tokens (Gu et al., 22 Sep 2025)) for practical use in heterogeneous and missing-data environments.

These developments indicate that learnable modality tokens are foundational for controllable, interpretable, and scalable multimodal learning—serving operational, biomedical, and creative AI systems where precise cross-modal fusion, efficiency, and adaptability are paramount.