Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rethinking LLM Ensembling from the Perspective of Mixture Models

Published 1 May 2026 in cs.LG and cs.CL | (2605.00419v1)

Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to LLMs, yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture-model-like-Ensemble/.

Summary

  • The paper proposes the Mixture-model-like Ensemble (ME) approach, which samples a single model per step to reproduce ensemble outputs at a fraction of the cost.
  • The paper demonstrates that ME achieves equivalent accuracy to conventional ensembling across benchmarks while providing a 1.78×–2.68× speedup.
  • The paper unifies LLM ensembling with stochastic token-level routing, offering insights into cache synchronization and vocabulary alignment for improved efficiency.

Mixture Model Paradigm for LLM Ensembling

Introduction

The use of model ensembling to enhance prediction reliability and accuracy is established in both classical machine learning and deep learning. However, when applied to LLMs, traditional ensembling—which aggregates output distributions from multiple models—results in substantial computational overhead due to the requirement of multiple forward passes per token. The paper "Rethinking LLM Ensembling from the Perspective of Mixture Models" (2605.00419) revisits this paradigm, introducing the Mixture-model-like Ensemble (ME) approach and positioning LLM ensembling as a special case of mixture models. This reconceptualization yields a theoretical and empirical equivalence with conventional methods, but at a fraction of the computational cost, and provides a formal bridge between ensemble and token-level routing strategies.

Methodology

Conventional Ensemble vs. Mixture-model-like Ensemble

Traditional LLM ensembling aggregates each model's next-token distribution via (possibly weighted) averaging at each decoding step. All constituent models are invoked to produce their distributions; the resulting averaged distribution governs next-token sampling. While this approach leverages model diversity for performance gains, the cost grows linearly with the number of models.

The ME method instead performs, at each step, a single forward pass: a model is randomly sampled from the mixture distribution defined by the ensemble weights and is solely responsible for that step’s token sampling. This process is mathematically equivalent to sampling from the weighted mixture of the models' distributions across steps, achieving an output distribution identical to the explicit ensemble but reducing inference time and hardware demand by a factor of $1/n$.

KV Cache Synchronization and Heterogeneous Vocabularies

ME introduces asynchronicity in KV cache management, given that models may not be sequentially selected. Lazy synchronization resolves this: only the selected model's cache is updated as required, efficiently amortizing the cost over multiple non-selected steps.

To address vocabulary heterogeneity—common in multinational or domain-adapted LLMs—ME integrates with established vocabulary alignment methods, such as UniTe, by transforming individual model output distributions to a unified vocabulary space prior to sampling.

Conceptual Unification with Token-level Routing

The stochastic selection in ME reveals LLM ensembling as a restricted form of token-level routing, where the router is non-deterministic and untrained. General routing methods leverage trainable routers to guide token-level model selection, potentially achieving superior performance at increased training cost and system complexity. This unification enables a principled comparison across efficiency, accuracy, and adaptability axes.

Experimental Evaluation

Comprehensive experiments span:

  • Ensembles of similar models (identical architectures/different datasets)
  • Ensembles of heterogeneous models (differing architectures, sizes, vocabularies)
  • Ensembles mixing models with varying parameter counts

Benchmarks include GSM8K, MMLU, BBH, and ARC, and evaluations cover both standard and high-throughput hardware (e.g., H100, A100, V100, RTX 3090).

Key empirical results:

  • Performance equivalence: ME consistently matches conventional ensembling (CE) in accuracy on all tasks and model groupings, including two- and three-model settings, and across both homogeneous and heterogeneous ensembles.
  • Speed improvement: ME achieves a 1.78×1.78\times2.68×2.68\times speedup over CE. Its throughput approaches that of single-model inference, with the minor additional overhead attributed to occasional KV cache pre-filling.
  • No guarantee of monotonic gains: Adding more models to an ensemble does not always yield proportional performance enhancements; effectiveness can plateau or worsen due to model compatibility and diversity factors.
  • Effective across devices: Results generalize across diverse hardware; in some settings (e.g., RTX 3090), parallel CE is even slower than sequential CE due to communication bottlenecks, while ME maintains efficiency gains.

When combining models of different sizes, ME allows granular adjustment of the performance/speed trade-off through ensemble weights, though for explicit control, more sophisticated routing is preferable.

Discussion and Implications

Practical Impact

ME renders LLM ensembling practical for latency- and budget-sensitive deployments, eliminating the dominant inference bottleneck with negligible performance compromise. The plug-and-play nature, not requiring further training or significant system modification, enhances its real-world deployability.

Theoretical Perspective

Framing ensembling as stochastic mixture modeling aligns LLM inference practices with established statistical modeling frameworks and multi-expert architectures. The connection to routing suggests research opportunities: ME may serve as a baseline or as a component in more general expert-selection systems, and future work could consider data- or context-aware routers to exceed the random-selection baseline.

Limitations and Extension

The equivalence only holds for stochastic (sampling-based) decoding; it does not extend to greedy or beam search inference, which require explicit aggregation of deterministically chosen tokens. The approach is extensible to other probabilistic combination schemas, but a general characterization of "separable" combination structures is pending further research.

Conclusion

The Mixture-model-like Ensemble approach fundamentally re-examines LLM ensembling, demonstrating that explicit aggregation of model outputs can be replaced with stochastic model selection while preserving the ensemble distribution. This yields significant speed improvements and situates LLM ensembling within the broader context of mixture models and routing methods. These insights not only remove critical practical barriers for LLM ensemble adoption but also open new directions for efficient, collaborative model decoding and expert system design in NLP. Future developments may explore adaptive routing policies and hybrid combination structures, further expanding the frontier of efficient model collaboration.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.