Amortized Model Ensembling (AME)
- Amortized Model Ensembling (AME) is a meta-optimization framework that combines ensemble benefits with reduced inference and memory costs.
- It utilizes output-space stochastic methods like Mixture-model-like Ensembles and parameter-space gradient-based aggregation to synthesize ensemble predictions.
- AME achieves significant speedups and robustness by generalizing classic model soup techniques and adapting to diverse computational scenarios.
Amortized Model Ensembling (AME) is a meta-optimization paradigm for synthesizing the benefits of model ensembling while reducing the computational or memory cost over conventional pointwise ensemble averaging. AME encompasses both stochastic inference-time ensembling methods—such as the Mixture-model-like Ensemble (ME) for autoregressive LLMs—and parameter-space aggregation strategies—such as gradient-based neural averaging—unified by the principle of amortizing the ensemble computation over time, space, or iterations. AME enables the construction of ensemble-level predictions or solutions at significantly reduced inference or integration cost, and generalizes classic uniform model soup and mixture-of-experts frameworks in both theoretical and practical settings (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
1. Formalization and Unification Framework
AME is instantiated in two complementary settings: (a) output-space stochastic ensembling, where a sequence of forward passes or samples is amortized across base models (e.g., ME for LLMs), and (b) parameter-space aggregation, where a single set of model weights is synthesized from multiple experts via pseudogradient meta-optimization (data-free neural averaging).
Output-space AME: Mixture-model-like Ensemble (ME)
Given fine-tuned models with next-token distributions , the conventional ensemble distribution is:
ME instead draws, at each token step, a single model (with probability ) and samples . The marginal next-token distribution under ME is provably equal to (Fu et al., 1 May 2026).
Parameter-space AME: Neural Averaging via Meta-Optimization
Given pretrained DNNs with weights , AME defines “neural averaging” by performing stochastic gradient-based aggregation over a quadratic proxy loss,
with each 0 viewed as an independent draw from the weight distribution 1. AME generalizes model soup, optimizer-augmented ensembling, and meta-adaptive aggregation (Lee et al., 20 Aug 2025).
2. Mathematical Guarantees and Equivalence Properties
Output Equivalence (ME)
Sampling each token via ME yields the same marginal distribution as the explicit ensemble:
2
establishing the strict equivalence of ME’s amortized sampling to full ensemble sampling (Fu et al., 1 May 2026).
Parameter Aggregation Guarantees (AME)
AME admits the following theoretical properties (Lee et al., 20 Aug 2025):
- Plain gradient descent ensembling (with learning rate 3, amplification 4) over all 5 recovers uniform model soup: 6.
- More general convex combinations and adaptive soups are achievable via tailored learning rates and pseudogradient scaling.
- Under suitable decaying learning rates (7) and bounded 8, AME with AdaGrad or Adam converges with the ensemble parameter trajectory contained in the convex hull of the ingredients.
- Model soup converges in probability to the expected value of 9.
3. Algorithmic Formulations and Implementation Details
ME for LLMs (Pseudocode)
2 Lazy KV cache synchronization ensures that each model’s cache is updated only when selected; missing tokens since last selection are “prefilled” in a batched pass (Fu et al., 1 May 2026).
AME in Parameter Space (Pseudocode)
3 Optimizer and schedule selection enables adaptation to task, stability, and exploration of the weight simplex (Lee et al., 20 Aug 2025).
4. Computational Trade-offs and Empirical Results
Inference-Time Speedup (ME)
For a 0-model ensemble and 1-token sequence:
- Conventional explicit ensemble: 2 forward passes; 3 cost per token.
- Mixture-model-like ensemble: 4 forward passes; 5 cost per token. Observed speedups for ME versus conventional ensembling are 6–7 on NVIDIA H100, RTX 3090, A100, and V100 for two- and three-model LLM ensembles (Fu et al., 1 May 2026).
Test Performance Parity
ME statistically matches conventional ensemble decoding within 81 point. Example entries:
| Task | Best Single | CE (k=5) | ME (k=5) |
|---|---|---|---|
| GSM8K | 79.77 | 83.14 | 82.97 |
| MMLU | 66.75 | 66.05 | 65.61 |
| BBH | 51.94 | 52.74 | 53.04 |
| ARC | 81.81 | 81.14 | 81.12 |
Parameter-space AME: Out-of-domain and Robustness
AME outperforms both best single expert and uniform model soup in OOD regimes. For 50% OOD on CIFAR-10:
- Best Expert: 9
- Model Soup: 0
- AME (Adam): 1 (Lee et al., 20 Aug 2025).
5. Connections to Routing, Model Soup, and AME Principles
AME provides a unifying perspective over various ensembling, mixture-of-experts, and meta-aggregation methods:
- In output space, ME is a special case of token-level routing where routing decisions are made uniformly at random, rather than via a learned selector 2.
- Model soup is realized as a single epoch of gradient descent ensembling in AME; the “pivoted pseudogradients” and adaptivity enable nonconvex or prioritized mixtures.
- The cost of ensembling (full 3-fold computation) is amortized: ME achieves ensemble-level output statistics at only 4 the per-token cost; AME synthesizes a single model representing ensemble knowledge without storing all experts (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
6. Limitations and Research Directions
Known Limitations
- ME is strictly valid only for random-sample decoding, not deterministic greedy/argmax decoding, where 5 random argmax.
- Memory cost remains 6 model size; lazy KV cache synchronization can strain memory for large 7.
- Performance benefits plateau with highly correlated or redundant base models.
- In AME, theoretical analysis of hyperparameter-induced “phase transitions,” calibration, and layer/block-wise aggregation are unresolved (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
Open Directions
- Non-uniform weights in ME (by altering 8) for quality/efficiency tradeoff.
- Routing and learned randomization merge ME and mixture-of-experts, possibly via differentiable routers balancing latency and accuracy.
- Adaptive 9: dynamically adjust ensemble size based on computational budget.
- Greedy and beam heuristics: approximate ME for non-sampling decoders.
- In parameter-space AME, extensions include “FedSoup” for federated aggregation, calibration analyses, and neural averaging of inputs or architectural blocks.
7. Applications and Impact
AME underpins substantial new efficiency/quality tradeoffs for large model deployment and federated data-free model synthesis:
- For LLMs, ME achieves nearly identical generation quality to conventional ensemble decoding with 0–1 speedup on current hardware.
- Parameter-space AME produces robust neural aggregates for OOD generalization, memory preservation, and prototype synthesis, exceeding prior uniform or greedy soup baselines (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
- These methods provide a foundation for scalable, privacy-preserving, and adaptive model sharing or deployment.
AME stands as a mathematically principled, empirically validated, and extensible framework that amortizes ensemble-level performance into practical, computationally tractable algorithms across both weight- and output-space modalities (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).