Fusion for Language Modeling: Strategies & Insights
- Fusion for language modeling is a set of strategies that merge outputs from multiple models across distribution, representation, and parameter spaces to boost performance.
- It employs methods like shallow, cold, and dynamic fusion to integrate probabilities and semantic embeddings, thereby enhancing controllability and fluency.
- Emerging techniques such as plug-and-play, evolutionary algorithms, and Bayesian optimization enable effective fusion without retraining while improving generalization.
Fusion for language modeling encompasses a broad spectrum of algorithmic strategies for integrating, merging, or compositing probabilistic, representational, and/or parametric outputs from multiple LLMs to enhance fluency, knowledge breadth, controllability, robustness, or task specialization. Methods span from shallow log-linear scoring and distillation-based knowledge fusion to parameter-space mergers, evolutionary blending, and multimodal joint embedding schemes. Fusion can be performed at various architectural levels (logit, representation, vocabulary, or weight space) and is increasingly critical in modern language technology for maximizing utility from both diverse model assets and heterogeneous pretraining sources.
1. Taxonomy and Levels of Fusion
Fusion for language modeling operationalizes at the following principal levels:
- Probability- or Distribution-Level Fusion: Combines outputs at the logit or probability distribution level (e.g., via log-linear, KL-based, or token-by-token schemes). Examples: Shallow Fusion, Simple Fusion, Cold/Deep Fusion, dynamic/KL-based fusion (Stahlberg et al., 2018, Kurosawa et al., 2019, McDermott et al., 2020, Zouhar et al., 2022, Wan et al., 2024, Gu et al., 20 May 2025, Shi et al., 2024).
- Representation/Embedding-Level Fusion: Joins contextual or semantic representations within or across modalities via concatenation, adapters, gating, or cross-attention. Examples: LSTM/Transformer fusion with prefix embeddings or external semantic/hybrid input (Zouhar et al., 2022, Huang et al., 14 Sep 2025).
- Vocabulary-Level Fusion: Expands the model’s token space to permit early joint reasoning over previously disjoint vocabularies or domains (as in the “OneVocab” approach for DNA-language integration) (Li et al., 21 Jan 2026).
- Parameter-Space Fusion: Merges model weights (parameters) at the tensor or vector level, via averaging, regression, evolutionary algorithms, or Bayesian optimization (e.g., Model Soup, Evolver, BoMF) (Du et al., 2024, Jang et al., 2024).
- Textual/Segment-Level Fusion: Fuses models by reranking or ensembling at the segment or sequence output level, often with no parameter adaptation (“Cool-Fusion” and GFD) (Liu et al., 2024, Hsu et al., 2024).
The following table provides a concise mapping of representative fusion strategies and their operational loci:
| Fusion Method | Fusion Site | Core Operation | Example Paper(s) |
|---|---|---|---|
| Shallow/Log-linear | Logit/Probability | Weighted sum of log-probabilities | (Stahlberg et al., 2018, McDermott et al., 2020) |
| Simple Fusion | Logit/Probability | Additive residual modeling | (Stahlberg et al., 2018) |
| Cold/Deep Fusion | Hidden+Logit | Gated combination of LM/TM reps | (Inaguma et al., 2018, Kurosawa et al., 2019) |
| Memory Attentive | Decoder Multi-hop | Cross-attention multi-hop memory | (Ihori et al., 2020) |
| Distribution Distill | Logit/Probability | KL distillation across LMs | (Wan et al., 2024, Shi et al., 2024) |
| Text/Segment Rerank | Output level | Ensemble via consensus reranking | (Liu et al., 2024, Hsu et al., 2024) |
| Evolutionary Merge | Parameter Space | Mutation/crossover/greedy selection | (Du et al., 2024) |
| Bayesian Opt/Convex | Parameter Space | Multi-objective coefficient opt. | (Jang et al., 2024) |
| Semantic/Feature | Input/Embed | Gated projection of interpretable fea. | (Huang et al., 14 Sep 2025) |
| Early Vocab-Expand | Embedding + Token | Joint DNA-text modeling at token level | (Li et al., 21 Jan 2026) |
| Component/Prompt | Layer-wise/Prompt | Task-adaptive layer selection/fusion | (Si et al., 23 Sep 2025) |
Each fusion locus and mechanism implies distinct tradeoffs in model complexity, flexibility, compatibility, and downstream controllability.
2. Probabilistic and Representation-Level Fusion Mechanisms
Early and classical methods perform probability-level fusion by combining the log-probabilities or logits from a LLM (LM) and a conditional or sequence-to-sequence (seq2seq) model (e.g., NMT, ASR, OCR). In Shallow Fusion, the two models are scored via
where is a task-specific tunable weight (Stahlberg et al., 2018, McDermott et al., 2020, Inaguma et al., 2018).
Simple Fusion advances by training the conditional model to produce residual logits, with two variants:
- PreNorm:
- PostNorm:
Simple Fusion yields strong BLEU gains in low-resource NMT relative to shallow/cold fusion (+0.24 to +2.36 BLEU) (Stahlberg et al., 2018).
Cold Fusion incorporates a dedicated gating mechanism, fusing hidden representations of the seq2seq decoder and RNN LM before projection to the vocabulary, learning to dynamically control the LM's contribution at each step (Inaguma et al., 2018, Kurosawa et al., 2019).
Dynamic Fusion replaces static interpolation with fine-grained, context-sensitive attention over the LM's vocabulary embeddings, allowing the translation model to attend to, weight, and filter LM outputs based on source adequacy and target fluency:
with concatenation and post-fusion MLP for output distribution (Kurosawa et al., 2019).
Memory Attentive Fusion introduces, for transformer-based seq2seq, multi-hop cross-attention over an external LM’s hidden-state memory at each decoder block, gating the fused source and LM contexts at each block. Empirically, this configuration outperforms both shallow and cold fusion on text-style conversion BLEU-3 and demonstrates the value of iterative, layer-wise retrieval of LM knowledge (Ihori et al., 2020).
In representation-level fusion, fixed-size semantic or prefix embeddings (e.g., BERT-CLS spans, fuzzy semantic feature vectors) are injected into an RNN or transformer model by concatenation, linear projection, addition, or adaptive gating. Dynamic, per-token gating (e.g., soft-mixture) has shown the best perplexity reductions on standard corpora (Zouhar et al., 2022, Huang et al., 14 Sep 2025).
Semantic fusion with fuzzy-membership features enables highly controllable text generation while simultaneously improving model perplexity and preserving transformer input/output embedding tying; this supports interpretable, user-steerable generation (polarity, punctuation), robust OOD control, and low-overhead integration (Huang et al., 14 Sep 2025).
3. Parameter-Space and Checkpoint Fusion
Model fusion in parameter space aims to merge multiple fine-tuned LMs into a single global model that generalizes across domains or tasks. Principal approaches include:
- Averaging and convex-combination (“model soups”): Uniform or learned convex combination of parameter vectors from model populations; effective in vision/GLUE but less so in NLU tasks due to loss–metric surface misalignment (Jang et al., 2024).
- Evolutionary fusion (Evolver): Differential-evolution-inspired mutation/crossover/greedy selection in parameter space. Evolver iteratively produces new model variants by combining weights of parent models, keeping improvements on held-out dev sets. This method is gradient-free and requires only a small dev set for selection; it is empirically shown to outperform Fisher-weighting, regression-based merging (RegMean), and TIES in both accuracy and generalization (Du et al., 2024).
- Multi-objective Bayesian optimization (BoMF): Simultaneous fine-tuning and parameter fusion via Gaussian process surrogates and EHVI. BoMF considers the misalignment between loss and metric surfaces and finds trajectories and combinations that maximize the metric Pareto front. For large PLMs, a two-stage process leverages hyperparameter BO (on a cheap proxy model) and fusion BO on full models. BoMF consistently surpasses standard fine-tuning and SWA, improving GLUE and NLG metrics (Jang et al., 2024).
- NTK-aware clustering (MLP Fusion): Clusters subcomponents of PLM MLP modules to minimize the Adam NTK approximation error, producing compressed architectures with empirically minimal impact on fine-tuning dynamics and downstream performance (Ai et al., 2023).
Distribution-level fusion/distillation fuses the generative knowledge of multiple pre-trained LLMs by aligning their next-token distributions on shared reference corpora via distillation loss (KL divergence), leveraging robust minimum-edit-distance matching for disjoint vocabularies (Wan et al., 2024). This method supports arbitrary architectures/vocabularies and achieves consistent aggregate improvements in reasoning, code, and speed of convergence compared to individual models or naive ensemble/averaging.
4. Fusion across Heterogeneous Architectures and Modalities
Recent work generalizes fusion beyond single-modality LMs:
- Byte-level fusion: Generative Fusion Decoding (GFD) fuses a cross-modal text recognition model (ASR/OCR) with an LLM by mapping both models' outputs to the byte level and synchronously fusing log-probabilities at each decoding step, enabling seamless plug-and-play for heterogeneous tokenizers and real-time, instruction-aware adaptation (Hsu et al., 2024).
- Vocabulary-level fusion (early integration): In DNA–language modeling, “OneVocab” expands the LM’s vocabulary with -mers so that genomic and textual tokens are fused at the earliest (embedding) layer. This allows layer-1 self-attention between DNA and text, unlocking fine-grained reasoning unavailable to late (embedding-level) approaches. On genomics, early vocabulary integration outperforms embedding-alignment methods by large margins in both classification and reasoning tasks (Li et al., 21 Jan 2026).
- Prompt-adaptive and component-selective fusion: HarmoniFuse demonstrates component-selective and prompt-conditioned fusion mechanisms in multi-task speech LMs, selecting and fusing specific acoustic and semantic layers via differentiable softmax gates as determined by the prompt or the task—empirically achieving best-in-class results for both ASR and SER in shared architectures (Si et al., 23 Sep 2025).
The following table summarizes select cross-modality mechanisms:
| System/Paper | Modality | Fusion Level | Notable Result |
|---|---|---|---|
| GFD (Hsu et al., 2024) | ASR/OCR ↔ LLM | Byte-level | –18.8% WER (ATCO2) |
| HarmoniFuse (Si et al., 23 Sep 2025) | Speech multitask | Layer+Prompt selective | SOTA ASR+SER WER/UA |
| OneVocab (Li et al., 21 Jan 2026) | DNA–language | Token/vocab merge | +6.5% classification |
| Semantic fusion (Huang et al., 14 Sep 2025) | Token semantics | Gated adapter | +4.3% PPL OOD control |
5. Fusion without Training: Inference-Time and Plug-and-Play Paradigms
Newer paradigms bypass re-training or weight merging, relying on either segment-level or sequence-level consensus/reranking:
- Cool-Fusion: Leverages an inference-time segment proposal and reranking workflow. Distinct LLMs independently propose aligned “chunks,” and all K sources score every chunk under their own scoring interface (even under different vocabularies). The chunk with the lowest average perplexity is chosen and the process repeats. This method is architecture-agnostic, trivially parallelizable, and achieves empirical gains—for example, +8%–17% accuracy on GSM8K over the strongest single model (Liu et al., 2024).
- Generative Fusion Decoding: Proceeds at the byte level, aligning heterogeneous inputs (speech/vision+LLM) and sequentially combining prefix likelihoods, each model correcting or suggesting at alternating steps. This plug-and-play paradigm enables prompt-driven adaptation and instruction-aware recognition (Hsu et al., 2024).
Such inference-only fusion approaches are computationally attractive, easily executed across closed-source or disjoint-architecture models, and are especially relevant for system integration in production or cross-API settings.
6. Preference Optimization, Sequence-Level, and Advanced Distillation Fusion
Fusion is now extending into preference optimization and implicit sequence-level strategies:
- InfiFPO (Implicit Model Fusion via Preference Optimization): Replaces the reference model in DPO with a sequence-level fused distribution—the weighted geometric mean of multiple sources' full-sequence probabilities. Key enhancements include probability clipping to ensure correct preference gradients, sequence-level length normalization, and a max-margin winner-take-all fusion heuristic. InfiFPO achieves +3.4 average point gains over strong DPO/fusion baselines across maths, code, and reasoning benchmarks, and is highly competitive in both efficiency and zero-shot robustness (Gu et al., 20 May 2025).
- Progressive Multi-Mode Fusion: ProFuser employs a two-phase progressive fusion, starting with inference-mode (full generations scored with reward models) and gradually shifting toward training-mode (teacher-forced ground-truth distillation). This easy-to-hard progression ensures that the fused model inherits both the empirical problem-solving power of “best” generations and the token-by-token correctness of standard supervised fine-tuning, outperforming ensemble distillation and state-of-the-art model-weight merging on a diverse battery of benchmarks (Shi et al., 2024).
The sequence/segment-level fusion, preference optimization, and multi-fusion distillation regimes reflect recent shifts toward fusing not just “knowledge,” but task-aligned, preference-sensitive, and behavioral aspects of multiple LMs.
7. Theoretical and Practical Considerations; Challenges and Open Directions
Key practical and theoretical insights include:
- Metric–Loss Misalignment: Parameter averaging and SWA are effective in vision due to loss–metric correlation, but in NLP fine-tuning, metric/loss landscapes are poorly aligned. Multi-objective Bayesian optimization is necessary to maximize downstream validation metrics during fusion (Jang et al., 2024).
- Evolutionary Search Robustness: Evolver–style search escapes the local minima of linear averaging and leverages per-domain/expert diversity without gradient computation, yielding superior merging across divergent domains (Du et al., 2024).
- Component/Task Selectivity: Prompt and layer-adaptive fusion mechanisms, such as those in HarmoniFuse, demonstrate that learned selection of which components and representations to fuse is crucial for efficient multi-tasking and minimizing interference (Si et al., 23 Sep 2025).
- No-Training/Plug-and-Play Viability: Recent methods enable powerful fusion without fine-tuning, either by consensus reranking (Cool-Fusion), segment-wise composition (GFD), or sequence-level probability composition (InfiFPO).
- Future directions: Hybrid architectures combining early and late fusion, multi-stage fusion across parameter, representation, and probability spaces; dynamic or confidence-weighted fusion decisions; efficient scaling to many sources; robustness under quantization; automated fusion strategy selection.
Principal open challenges include scalability to many-source fusion, theory for nonconvex multi-criterion parameter landscapes, efficient hyperparameterization of fusion coefficients, and task-specific fusion where the optimal fusion objective is non-additive, non-scalar, or aligns with complex behavioral metrics.
Fusion for language modeling remains a rapidly evolving meta-technology, fundamental for maximizing the utility of pretrained knowledge, achieving compositional generalization, and robustly recycling heterogeneous language assets in both research and practice.