Unified Sequence Modeling with Modality Adapters

Updated 12 November 2025

Unified sequence modeling with modality-specific adapters is a strategy that integrates lightweight, custom modules into standard transformers to efficiently process diverse inputs.
The approach decouples modality specialization from the core model, enabling rapid integration of new sensor types and domains with less than 5% additional parameters.
Empirical studies show state-of-the-art performance and sample efficiency across multimodal tasks, achieving competitive accuracy with significantly reduced computational cost.

Unified sequence modeling with modality-specific adapters refers to a family of architectural strategies, predominantly within the Transformer paradigm, that achieve unified processing of highly heterogeneous input modalities (text, vision, audio, structured signals, etc.) by integrating lightweight, modality-customized parameter modules—"adapters"—along standardized model pathways. This approach enables single models to efficiently and robustly process, align, and reason over arbitrarily composed inputs, with the flexibility to extend to unseen modalities or new tasks using minimal trainable parameters or data.

1. Foundational Principles and Motivations

Classical multimodal architectures relied on separate module branches—one per modality—followed by ad-hoc fusion or late concatenation, leading to parameter inefficiency and inflexible task adaptation. With the advent of large-scale foundation models, unified sequence modeling became desirable: a single backbone (often a frozen or lightly fine-tuned Transformer) should handle disparate modality inputs and exhibit seamless multi-task, multi-domain transfer.

The key technical innovation is the insertion of modality-specific adapters at strategic positions in the backbone. These adapters project, fuse, or reshape modality representations, often implementing parameter-efficient transfer at <5% of backbone size. The adapters can also promote cross-modal alignment or enable token-level fusion, as seen in multiple leading frameworks (Wander (Guo et al., 12 Dec 2024), UniAdapter (Lu et al., 2023), SEMI (İnce et al., 4 Sep 2025), VideoLLM (Chen et al., 2023), UASTrack (Wang et al., 25 Feb 2025), UniSOT (Ma et al., 3 Nov 2025)).

Adapters decouple modality specialization from the (often very large and frozen) model core, enabling rapid integration of new sensor types, low-resource domains, or “reference modalities” (e.g., bounding box vs. language queries) without catastrophic interference or full model duplication.

2. Architecture Designs: Adapter Placement and Formulation

Most unified architectures insert adapters in three locations (adhering to (Lu et al., 2023, Guo et al., 12 Dec 2024, Ma et al., 3 Nov 2025, Wang et al., 25 Feb 2025)):

Before/after cross-modal attention: To (re-)align representations between vision, language, etc.
Within encoder/decoder blocks: To regulate intra-modality updates and bottleneck dimension per modality.
At input/output fusion junctions: To mediate projection from raw encoder output into a model-agnostic feature space.

Adapters may have diverse forms:

Standard linear bottleneck adapters (Lu et al., 2023): Down-project via shared matrix, nonlinearity, up-project via modality-specific matrix:

$\operatorname{Adapter}(\mathbf{x}) = \mathbf{x} + s\,\sigma(\mathbf{x}W_{\downarrow})\,W_{\uparrow}$

Partial weight sharing: Share the down-projection among all modalities, specialize the up-projection. This reduces parameters while preserving knowledge transfer (Lu et al., 2023).
Low-rank decomposed adapters: Express the adapter as $B A^\top$ with $A, B$ small, decoupling capacity per modality or per attention/MLP layer (Guo et al., 12 Dec 2024, İnce et al., 4 Sep 2025, Ma et al., 3 Nov 2025).
Task-masked attention or routing: Insert masks or small MLPs (e.g., DAS in UASTrack) to dynamically route tokens or select the correct adapter based on input type (Wang et al., 25 Feb 2025, Ma et al., 3 Nov 2025).
Outer-product or CP-decomposed modules: Fuse multiple modality representations via efficient factorized outer products to allow true high-order multimodal interaction (Guo et al., 12 Dec 2024).

A common motif is maintaining frozen backbones (e.g., ViT, BERT, decoder-only LLM) and only training or generating adapter parameters, with empirical evidence that adapters suffice to recover or exceed the representational power of full fine-tuning.

3. Multimodal Fusion Strategies in Unified Sequence Modeling

The fusion of modality-specific information is central to unified modeling. Representative approaches include:

Mixture-of-Experts (MoE): Separate parallel FFNs or adapter paths in the decoder, with a gating mechanism selecting the expert per modality or per token, as in GenDoc (Feng et al., 2023).
Cross-modal side adapters and expert fusion: For sequential tasks, side adapters per modality process backbone features, and an explicit Mixture-of-Modality-Expert-Fusion module integrates outputs ((Fu et al., 14 Apr 2025)*, inferred).
Outer-product fusion with low-rank CP decomposition: In Wander, token-level representations of each modality are projected into a shared latent, then fused by repeated outer-product + low-rank projections, avoiding exponential parameter growth (Guo et al., 12 Dec 2024).
Gated or attention-based fusion: Cross-attention may be used to select or merge reference and input modalities (e.g., image, text, box), with adapters mediating flows (Ma et al., 3 Nov 2025, Lu et al., 2023).

Table: Adapter Fusion Mechanisms Across Models

Model	Fusion Mechanism	Adapter Type
Wander	CP low-rank outer-prod	Token/seq-level, shared
UniAdapter	Shared-down/up-adapter	Blockwise bottleneck
GenDoc	MoE in decoder	Modality-specific FFN
UniSOT	Task-masked attention, LoRA	Ref- and modality-specific
SEMI	Projector+LoRA adapter	Hypernetwork-generated

This diversity reflects the interplay between computational cost, modularity, and expressivity.

4. Adapter Generation, Training, and Transfer Protocols

The process of generating and training effective modality-specific adapters involves several specialized strategies:

Hypernetwork-based adapter synthesis: SEMI (İnce et al., 4 Sep 2025) uses a hypernetwork trained to emit modality-specific LoRA adapters given only few-shot paired examples and a textual instruction, enabling rapid extension to unseen modalities.
Unified multi-task pretraining: Adapters are optimized over all (present) modalities and tasks simultaneously using loss terms for each, often with balanced weights (Lu et al., 2023, Ma et al., 3 Nov 2025).
Contrastive, retrieval, and discriminative losses: Contrastive objectives align latent spaces; cross-entropy or NCE losses are standard for retrieval and VQA tasks (Lu et al., 2023).

Parameter sharing and bottleneck size are tuned to trade off expressivity and parameter savings. Sharing down-projection but not up-projection (UniAdapter (Lu et al., 2023)) is found to outperform other regimes, enabling per-modality specificity at minimal cost.

5. Empirical Effects: Efficiency, Generalization, and Task Performance

Across image-text, video-text, document, and sensor data domains, modality-specific adapters have shown state-of-the-art parameter efficiency and sample efficiency:

Wander (Guo et al., 12 Dec 2024) achieves SOTA or near-SOTA performance across 2- to 7-modality datasets (e.g., MSRVTT, CMU-MOSI, IEMOCAP) with 0.8–4.9M parameters (≪5% total).
UniAdapter (Lu et al., 2023) matches or exceeds full fine-tuning on MSRVTT retrieval (R@1=49.7% with 4.8M tunables vs. 47.7% with 337M), and is competitive on VQA, VideoQA, and Flicker/Image-Text retrieval.
GenDoc (Feng et al., 2023) leverages modality-specific experts to outperform encoder-only models (LayoutLMv3, DiT) in ANLS, mAP, and entity-F1 across DocVQA, PubLayNet, and CORD.
Sample-efficient integration (SEMI) (İnce et al., 4 Sep 2025) yields 16–64× reduction in required paired data for new modalities (e.g., satellite, astronomical, molecular) compared to training from scratch.

Adapters also support real-time operation for single object tracking (UASTrack, 44 FPS with ~2M params (Wang et al., 25 Feb 2025); UniSOT, 58 FPS (Ma et al., 3 Nov 2025)) while exceeding previous modality- or reference-specific pipelines.

6. Design Variants: Flexibility and Limitations

Variants and extensions of unified sequence modeling with adapters include:

Token-level vs. vector-level fusion: Token-wise adapters (Wander) enable fine-grained multimodal alignment versus only global (vector) fusion.
Dynamic vs. static adapter routing: Auto-selection modules (e.g., DAS in UASTrack, modality-masked attention in UniSOT) provide input-adaptive adapter selection, increasing flexibility.
Adapter expressivity trade-off: Embedding adapters deeper in the model (e.g., at every Transformer layer) enhances cross-modal binding at the cost of higher memory, while shallow insertions optimize efficiency.
Support for unseen modalities: Hypernetwork and low-rank decomposed adapters allow extension to previously unseen or heterogeneous modalities (SEMI, Wander).
Limitations: Some frameworks support only one modality at a time during inference (İnce et al., 4 Sep 2025), while others are limited to pre-defined fusion architectures. Adapter sizes may underfit extremely diverse semantics if too shallow (VideoLLM (Chen et al., 2023)).

7. Impact, Extensions, and Future Outlook

Unified sequence modeling with modality-specific adapters is now a central paradigm in multimodal machine learning. It enables:

End-to-end, modular, and reusable inference across vision, language, structured signals (IMU), scientific modalities (molecules, astronomical data), and sensor fusion.
Sample- and parameter-efficient adaptation: Models such as Wander, UniAdapter, and SEMI consistently achieve or surpass full fine-tuning with only 1–5% additional parameters and 16–64× less data for new domains.
Seamless expansion: Adapters offer a plug-and-play mechanism for integrating new encoders, tasks, or modalities into unified, foundation architectures.
Research frontiers: Future work includes universal adapters for multiple modalities at once, hierarchical adapter synthesis, and integration into deeper layers (cross-attention), as well as further exploration of hard mixture gating, continual learning, and compositionality.

This approach fundamentally changes cross-modal modeling by shifting the computational and representational burden from monolithic retraining to a modular, adapter-centric view, aligning with scalable foundation model design and real-world deployment demands.

Key references:

"A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter" (Guo et al., 12 Dec 2024)
"UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling" (Lu et al., 2023)
"Sample-efficient Integration of New Modalities into LLMs" (İnce et al., 4 Sep 2025)
"UniSOT: A Unified Framework for Multi-Modality Single Object Tracking" (Ma et al., 3 Nov 2025)
"UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking" (Wang et al., 25 Feb 2025)
"VideoLLM: Modeling Video Sequence with LLMs" (Chen et al., 2023)
"Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding" (Feng et al., 2023)