MALM: Modular Multi-Information Adapters
- MALM is a class of modular, parameter-efficient neural adapters that extend pre-trained models for efficient transfer learning, multitask learning, and domain adaptation.
- They incorporate lightweight, learnable units integrated via merging strategies such as summation, concatenation, and graph fusion to handle multi-source information.
- Empirical evaluations across vision, language modeling, and neural translation reveal near single-task performance and significant computational savings despite some interference trade-offs.
Multi-Information Adapters (MALM) are a class of modular, parameter-efficient neural architecture components used to facilitate transfer, multitask learning, domain adaptation, and knowledge integration in both vision and language settings. These adapters extend pre-trained transformers or similar models by introducing lightweight, learnable units that specialize for given tasks, sources of information, or domains, and can be flexibly combined, stacked, or merged. Cross-disciplinary instantiations—ranging from multi-LoRA merges in vision (Kesim et al., 2024) through multi-graph-attentional mitigation of LLM hallucinations (Jia et al., 14 Jun 2025) to compositional multilingual/domain adaptation in neural translation (Stickland et al., 2021)—demonstrate MALM's adaptability and the unifying principle of multi-information injection.
1. Foundational Concepts and Architectures
MALM generalizes the adapter principle: selectively parameterized, low-dimensional or bottleneck modules are inserted into pre-trained backbone architectures, enabling efficient adaptation to tasks or domains with minimal full-model retraining. Common design patterns include:
- Vision Transformer Adapters: LoRA-style modules are injected into the key and value projection matrices of a frozen Vision Transformer (ViT) base. The low-rank update is parameterized as with , (Kesim et al., 2024).
- LLM Hallucination Adapter: For autoregressive LLMs, MALM is constructed as a plug-in between the final transformer block and output head. It processes (a) input queries, (b) context (partial outputs), (c) external knowledge, via multi-graph input to a graph attention network (GAT), outputting reweighted logits fused with the LLM (Jia et al., 14 Jun 2025).
- NMT Multilingual/Domain Adapters: Adapters are stacked per transformer sublayer, with separate parameters for language (LA) and domain (DA). Each consists of bottleneck projections and norm layers, and can be composed in encoder, decoder, or both, yielding fine-grained control over adaptation (Stickland et al., 2021).
2. Adapter Merging, Composition, and Interaction
A core attribute of MALM is the ability to combine information from multiple adapters. This can occur via:
- Linear (Summation) Merging: For LoRA adapters, ; in practice, uniform weights are used (Kesim et al., 2024).
- Concatenation Merging: Adapters' low-rank matrices are concatenated, producing and , so , which expands to a sum of all self and cross terms (Kesim et al., 2024). Empirically, cross-terms do not catastrophically degrade multitask performance.
- Multi-Graph Fusion for LLMs: Input, context, and knowledge are separately encoded, with their token representations interacting through a heterogeneous GAT as nodes in a multi-type edge graph, allowing the adapter to model complex dependencies between information sources (Jia et al., 14 Jun 2025).
- Stacking for NMT: Language and domain adapters can be sequentially stacked within each transformer layer, or restricted to encoder/decoder sides to control interference and transfer (Stickland et al., 2021).
Adapter merging strategies determine the degree of parameter and inference sharing, the extent of information cross-contamination, and computational overhead.
3. Training, Inference, and Objective Functions
MALM relies on modular training with the following regime:
- Single-Adapter Training: Each LoRA or adapter module is trained independently on its designated task or domain, leaving the backbone frozen. The ViT-based vision adapters use cross-entropy (classification) or loss (regression) (Kesim et al., 2024), while LLM-based adapters use standard next-token cross-entropy (Jia et al., 14 Jun 2025). For NMT, adapters are trained with negative log-likelihood on (possibly synthetic) parallel text (Stickland et al., 2021).
- Merging and Inference: After training, adapters can be merged without retraining. For task-inference, this greatly speeds up evaluation: a merged LoRA for tasks requires a single backbone pass ( per projection) instead of sequential passes (Kesim et al., 2024).
- Zero-shot and Domain Adaptation: In NMT, training only DAs on a new domain in a subset of languages and leveraging back-translation for unobserved pairs enables robust cross-lingual transfer (Stickland et al., 2021).
- No Additional Losses: LLM MALM does not require hallucination-specific or auxiliary losses; the adapter’s architectural graph constraints suffice to reduce hallucination (Jia et al., 14 Jun 2025).
4. Empirical Results, Performance, and Trade-offs
MALM’s performance—quantified over multi-task, faithfulness, and adaptation metrics—depends on task similarity, adapter architecture, and merging strategy. Direct findings include:
- Vision Adapters: Merged LoRA adapters for tasks with dissimilar data, such as FireRisk and Galaxy10, demonstrate F1 retention within $1$ point of single-task (e.g., LoRA-64, F1 75.9 vs. 76.7), but similar-domain merges (e.g., UTKFace classification and regression) cause substantial accuracy degradation (RMSE increase from $0.868$ to $1.268$) (Kesim et al., 2024).
- LLM Hallucination Mitigation: MALM-adapted LLMs achieve +2–130% relative improvement in ROUGE-2 and set SOTA on faithfulness (FEQA) across HaluEval, TruthfulQA, NQ, and TriviaQA. Expert preference rates are high: 79.4% (GPT-4) and 65.6% (human) outcome selection favoring MALM (Jia et al., 14 Jun 2025).
- NMT Domain Adaptation: In full-resource multilingual adaptation, language+domain adapters achieve BLEU within 1–2 points of domain-tagged full-finetuning baselines (e.g., LA+DA combo: BLEU 42.7 vs. 46.0, Table 3), while in partial-resource settings, encoder- or decoder-only DAs with back-translation give gains of $3$–$4$ BLEU (e.g., 36.9 achieved for unseen directions) (Stickland et al., 2021).
- Computational Cost: Merging adapters leads to linear inference time reductions, requiring just a single forward pass for tasks compared to unmerged deployment (Kesim et al., 2024).
Below is a table showcasing the adapter domains and key empirical outcomes:
| Setting | Adapter Strategy | Main Metrics | Key Results |
|---|---|---|---|
| Vision (ViT+LoRA) | Concatenation Merge | Accuracy, F1, RMSE, NME | Near-single-task for dissimilar; degradation for similar tasks (Kesim et al., 2024) |
| LLM Hallucination | Graph-attention MALM | ROUGE, FEQA, Human Eval | SOTA on faithfulness, large non-hallucination rate gains (Jia et al., 14 Jun 2025) |
| NMT Multilingual/Domain | LA + (encoder/decoder) DA; DADropout; BT | BLEU, off-target rate | +3–4 BLEU on unseen, off-target rate down from 23% → 6% (Stickland et al., 2021) |
5. Limitations and Diagnostic Findings
MALM demonstrates strong adaptability but is subject to several constraints:
- Performance Drop Upon Merging: Across vision and translation, merging adapters for similar domains induces larger degradation due to overlapping information interfering (cross-terms in ) (Kesim et al., 2024, Stickland et al., 2021).
- Catastrophic Forgetting: In NMT, naïve stacking of language and domain adapters for unseen language/domain combinations causes BLEU to drop below $20$, with high off-target rate (Stickland et al., 2021).
- No Dynamic Weighting: Current merging schemes use uniform weighting; adaptive or learned merging coefficients are left to future work (Kesim et al., 2024).
- Scope of Hallucination Mitigation: LLM-based MALM reduces input-, context-, and fact-conflicting hallucinations, but fine-grained numerical reasoning and multi-hop inference remain unresolved (Jia et al., 14 Jun 2025).
- Access Restrictions: LLM MALM requires access to model hidden states, currently limiting it to open-source models (Jia et al., 14 Jun 2025).
6. Extension Directions and Ongoing Work
Proposed improvements and research avenues for MALM include:
- Weighted and Orthogonal Merging: Introducing learnable per adapter or enforcing span orthogonality across matrices may further reduce interference (Kesim et al., 2024).
- Dynamic Gating and Routing: At inference, activating only a subset of relevant adapters or information sources could increase efficiency and decrease destruction from irrelevant domains (Kesim et al., 2024).
- Auxiliary Losses and Reasoners: For LLMs, integrating hallucination-aware losses or explicit chain-of-thought modules into the graph can provide additional mitigation signals (Jia et al., 14 Jun 2025).
- Continual Update and Modular Expansion: MALM facilitates continual learning by enabling addition (or removal) of adapters without retraining the backbone (Kesim et al., 2024).
- Generalization across Modalities: Extending MALM principles to generative vision models (e.g., diffusion) or to other PEFTs (e.g., DoRA, O-LoRA, QLoRA) is a plausible frontier (Kesim et al., 2024).
7. Significance and Synthesis
MALM consolidates multiple lines of adapter research under a unified paradigm of multi-information modularity. Empirical validations across disparate domains (vision, language modeling, translation) confirm that MALM-style adapters provide substantial practical benefit: parameter-efficient multitask deployment, robust domain transfer without catastrophic forgetting, and enhanced factual faithfulness in LLMs. The balance between modular composition, computational efficiency, and interference minimization remains an active area, with prospects of learned weighting and dynamic selection promising further gains.