MALM: Modular Multi-Information Adapters

Updated 17 March 2026

MALM is a class of modular, parameter-efficient neural adapters that extend pre-trained models for efficient transfer learning, multitask learning, and domain adaptation.
They incorporate lightweight, learnable units integrated via merging strategies such as summation, concatenation, and graph fusion to handle multi-source information.
Empirical evaluations across vision, language modeling, and neural translation reveal near single-task performance and significant computational savings despite some interference trade-offs.

Multi-Information Adapters (MALM) are a class of modular, parameter-efficient neural architecture components used to facilitate transfer, multitask learning, domain adaptation, and knowledge integration in both vision and language settings. These adapters extend pre-trained transformers or similar models by introducing lightweight, learnable units that specialize for given tasks, sources of information, or domains, and can be flexibly combined, stacked, or merged. Cross-disciplinary instantiations—ranging from multi-LoRA merges in vision (Kesim et al., 2024) through multi-graph-attentional mitigation of LLM hallucinations (Jia et al., 14 Jun 2025) to compositional multilingual/domain adaptation in neural translation (Stickland et al., 2021)—demonstrate MALM's adaptability and the unifying principle of multi-information injection.

1. Foundational Concepts and Architectures

MALM generalizes the adapter principle: selectively parameterized, low-dimensional or bottleneck modules are inserted into pre-trained backbone architectures, enabling efficient adaptation to tasks or domains with minimal full-model retraining. Common design patterns include:

Vision Transformer Adapters: LoRA-style modules are injected into the key and value projection matrices of a frozen Vision Transformer (ViT) base. The low-rank update is parameterized as $W = W_0 + AB^\top$ with $A,B \in \mathbb{R}^{d\times r}$ , $r \ll d$ (Kesim et al., 2024).
LLM Hallucination Adapter: For autoregressive LLMs, MALM is constructed as a plug-in between the final transformer block and output head. It processes (a) input queries, (b) context (partial outputs), (c) external knowledge, via multi-graph input to a graph attention network (GAT), outputting reweighted logits fused with the LLM (Jia et al., 14 Jun 2025).
NMT Multilingual/Domain Adapters: Adapters are stacked per transformer sublayer, with separate parameters for language (LA) and domain (DA). Each consists of bottleneck projections and norm layers, and can be composed in encoder, decoder, or both, yielding fine-grained control over adaptation (Stickland et al., 2021).

2. Adapter Merging, Composition, and Interaction

A core attribute of MALM is the ability to combine information from multiple adapters. This can occur via:

Linear (Summation) Merging: For $N$ LoRA adapters, $W_{\rm merged} = W_0 + \sum_{i=1}^N \alpha_iA_iB_i^\top$ ; in practice, uniform weights $\alpha_i=1$ are used (Kesim et al., 2024).
Concatenation Merging: Adapters' low-rank matrices are concatenated, producing $A_{\rm cat} = [A_1\;\cdots\;A_N]$ and $B_{\rm cat} = [B_1\;\cdots\;B_N]$ , so $W_{\rm merged} = W_0 + A_{\rm cat}B_{\rm cat}^\top$ , which expands to a sum of all self and cross terms (Kesim et al., 2024). Empirically, cross-terms do not catastrophically degrade multitask performance.
Multi-Graph Fusion for LLMs: Input, context, and knowledge are separately encoded, with their token representations interacting through a heterogeneous GAT as nodes in a multi-type edge graph, allowing the adapter to model complex dependencies between information sources (Jia et al., 14 Jun 2025).
Stacking for NMT: Language and domain adapters can be sequentially stacked within each transformer layer, or restricted to encoder/decoder sides to control interference and transfer (Stickland et al., 2021).

Adapter merging strategies determine the degree of parameter and inference sharing, the extent of information cross-contamination, and computational overhead.

3. Training, Inference, and Objective Functions

MALM relies on modular training with the following regime:

Single-Adapter Training: Each LoRA or adapter module is trained independently on its designated task or domain, leaving the backbone frozen. The ViT-based vision adapters use cross-entropy (classification) or $L_1$ loss (regression) (Kesim et al., 2024), while LLM-based adapters use standard next-token cross-entropy (Jia et al., 14 Jun 2025). For NMT, adapters are trained with negative log-likelihood on (possibly synthetic) parallel text (Stickland et al., 2021).
Merging and Inference: After training, adapters can be merged without retraining. For task-inference, this greatly speeds up evaluation: a merged LoRA for $A,B \in \mathbb{R}^{d\times r}$ 0 tasks requires a single backbone pass ( $A,B \in \mathbb{R}^{d\times r}$ 1 per projection) instead of $A,B \in \mathbb{R}^{d\times r}$ 2 sequential passes (Kesim et al., 2024).
Zero-shot and Domain Adaptation: In NMT, training only DAs on a new domain in a subset of languages and leveraging back-translation for unobserved pairs enables robust cross-lingual transfer (Stickland et al., 2021).
No Additional Losses: LLM MALM does not require hallucination-specific or auxiliary losses; the adapter’s architectural graph constraints suffice to reduce hallucination (Jia et al., 14 Jun 2025).

4. Empirical Results, Performance, and Trade-offs

MALM’s performance—quantified over multi-task, faithfulness, and adaptation metrics—depends on task similarity, adapter architecture, and merging strategy. Direct findings include:

Vision Adapters: Merged LoRA adapters for tasks with dissimilar data, such as FireRisk and Galaxy10, demonstrate F1 retention within $A,B \in \mathbb{R}^{d\times r}$ 3 point of single-task (e.g., LoRA-64, F1 $A,B \in \mathbb{R}^{d\times r}$ 4 75.9 vs. 76.7), but similar-domain merges (e.g., UTKFace classification and regression) cause substantial accuracy degradation (RMSE increase from $A,B \in \mathbb{R}^{d\times r}$ 5 to $A,B \in \mathbb{R}^{d\times r}$ 6) (Kesim et al., 2024).
LLM Hallucination Mitigation: MALM-adapted LLMs achieve +2–130% relative improvement in ROUGE-2 and set SOTA on faithfulness (FEQA) across HaluEval, TruthfulQA, NQ, and TriviaQA. Expert preference rates are high: 79.4% (GPT-4) and 65.6% (human) outcome selection favoring MALM (Jia et al., 14 Jun 2025).
NMT Domain Adaptation: In full-resource multilingual adaptation, language+domain adapters achieve BLEU within 1–2 points of domain-tagged full-finetuning baselines (e.g., LA+DA combo: BLEU 42.7 vs. 46.0, Table 3), while in partial-resource settings, encoder- or decoder-only DAs with back-translation give gains of $A,B \in \mathbb{R}^{d\times r}$ 7– $A,B \in \mathbb{R}^{d\times r}$ 8 BLEU (e.g., 36.9 achieved for unseen directions) (Stickland et al., 2021).
Computational Cost: Merging adapters leads to linear inference time reductions, requiring just a single forward pass for $A,B \in \mathbb{R}^{d\times r}$ 9 tasks compared to unmerged deployment (Kesim et al., 2024).

Below is a table showcasing the adapter domains and key empirical outcomes:

Setting	Adapter Strategy	Main Metrics	Key Results
Vision (ViT+LoRA)	Concatenation Merge	Accuracy, F1, RMSE, NME	Near-single-task for dissimilar; degradation for similar tasks (Kesim et al., 2024)
LLM Hallucination	Graph-attention MALM	ROUGE, FEQA, Human Eval	SOTA on faithfulness, large non-hallucination rate gains (Jia et al., 14 Jun 2025)
NMT Multilingual/Domain	LA + (encoder/decoder) DA; DADropout; BT	BLEU, off-target rate	+3–4 BLEU on unseen, off-target rate down from 23% → 6% (Stickland et al., 2021)

5. Limitations and Diagnostic Findings

MALM demonstrates strong adaptability but is subject to several constraints:

Performance Drop Upon Merging: Across vision and translation, merging adapters for similar domains induces larger degradation due to overlapping information interfering (cross-terms in $r \ll d$ 0) (Kesim et al., 2024, Stickland et al., 2021).
Catastrophic Forgetting: In NMT, naïve stacking of language and domain adapters for unseen language/domain combinations causes BLEU to drop below $r \ll d$ 1, with high off-target rate (Stickland et al., 2021).
No Dynamic Weighting: Current merging schemes use uniform weighting; adaptive or learned merging coefficients are left to future work (Kesim et al., 2024).
Scope of Hallucination Mitigation: LLM-based MALM reduces input-, context-, and fact-conflicting hallucinations, but fine-grained numerical reasoning and multi-hop inference remain unresolved (Jia et al., 14 Jun 2025).
Access Restrictions: LLM MALM requires access to model hidden states, currently limiting it to open-source models (Jia et al., 14 Jun 2025).

6. Extension Directions and Ongoing Work

Proposed improvements and research avenues for MALM include:

Weighted and Orthogonal Merging: Introducing learnable $r \ll d$ 2 per adapter or enforcing span orthogonality across $r \ll d$ 3 matrices may further reduce interference (Kesim et al., 2024).
Dynamic Gating and Routing: At inference, activating only a subset of relevant adapters or information sources could increase efficiency and decrease destruction from irrelevant domains (Kesim et al., 2024).
Auxiliary Losses and Reasoners: For LLMs, integrating hallucination-aware losses or explicit chain-of-thought modules into the graph can provide additional mitigation signals (Jia et al., 14 Jun 2025).
Continual Update and Modular Expansion: MALM facilitates continual learning by enabling addition (or removal) of adapters without retraining the backbone (Kesim et al., 2024).
Generalization across Modalities: Extending MALM principles to generative vision models (e.g., diffusion) or to other PEFTs (e.g., DoRA, O-LoRA, QLoRA) is a plausible frontier (Kesim et al., 2024).

7. Significance and Synthesis

MALM consolidates multiple lines of adapter research under a unified paradigm of multi-information modularity. Empirical validations across disparate domains (vision, language modeling, translation) confirm that MALM-style adapters provide substantial practical benefit: parameter-efficient multitask deployment, robust domain transfer without catastrophic forgetting, and enhanced factual faithfulness in LLMs. The balance between modular composition, computational efficiency, and interference minimization remains an active area, with prospects of learned weighting and dynamic selection promising further gains.

Markdown Report Issue Upgrade to Chat

References (3)

Multi LoRA Meets Vision: Merging multiple adapters to create a multi task model (2024)

MALM: A Multi-Information Adapter for Large Language Models to Mitigate Hallucination (2025)

Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Information Adapters (MALM).

MALM: Modular Multi-Information Adapters

1. Foundational Concepts and Architectures

2. Adapter Merging, Composition, and Interaction

3. Training, Inference, and Objective Functions

4. Empirical Results, Performance, and Trade-offs

5. Limitations and Diagnostic Findings

6. Extension Directions and Ongoing Work

7. Significance and Synthesis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MALM: Modular Multi-Information Adapters

1. Foundational Concepts and Architectures

2. Adapter Merging, Composition, and Interaction

3. Training, Inference, and Objective Functions

4. Empirical Results, Performance, and Trade-offs

5. Limitations and Diagnostic Findings

6. Extension Directions and Ongoing Work

7. Significance and Synthesis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research