Papers
Topics
Authors
Recent
Search
2000 character limit reached

Amortized Model Ensembling (AME)

Updated 3 July 2026
  • Amortized Model Ensembling (AME) is a meta-optimization framework that combines ensemble benefits with reduced inference and memory costs.
  • It utilizes output-space stochastic methods like Mixture-model-like Ensembles and parameter-space gradient-based aggregation to synthesize ensemble predictions.
  • AME achieves significant speedups and robustness by generalizing classic model soup techniques and adapting to diverse computational scenarios.

Amortized Model Ensembling (AME) is a meta-optimization paradigm for synthesizing the benefits of model ensembling while reducing the computational or memory cost over conventional pointwise ensemble averaging. AME encompasses both stochastic inference-time ensembling methods—such as the Mixture-model-like Ensemble (ME) for autoregressive LLMs—and parameter-space aggregation strategies—such as gradient-based neural averaging—unified by the principle of amortizing the ensemble computation over time, space, or iterations. AME enables the construction of ensemble-level predictions or solutions at significantly reduced inference or integration cost, and generalizes classic uniform model soup and mixture-of-experts frameworks in both theoretical and practical settings (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

1. Formalization and Unification Framework

AME is instantiated in two complementary settings: (a) output-space stochastic ensembling, where a sequence of forward passes or samples is amortized across base models (e.g., ME for LLMs), and (b) parameter-space aggregation, where a single set of model weights is synthesized from multiple experts via pseudogradient meta-optimization (data-free neural averaging).

Output-space AME: Mixture-model-like Ensemble (ME)

Given fine-tuned models M1,,MKM_1, \ldots, M_K with next-token distributions pk(xtx<t)p_k(x_t \mid x_{<t}), the conventional ensemble distribution is:

pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).

ME instead draws, at each token step, a single model MmM_m (with probability πm=1/K\pi_m = 1/K) and samples xtpm(x<t)x_t \sim p_m(\cdot\,|\,x_{<t}). The marginal next-token distribution under ME is provably equal to pensp_{ens} (Fu et al., 1 May 2026).

Parameter-space AME: Neural Averaging via Meta-Optimization

Given NN pretrained DNNs with weights x1,,xNRdx_1, \ldots, x_N \in \mathbb{R}^d, AME defines “neural averaging” by performing stochastic gradient-based aggregation over a quadratic proxy loss,

f(x;ξ)=12(xξ2ξ2),ξD,f(x; \xi) = \frac{1}{2}\left(\|x - \xi\|^2 - \|\xi\|^2\right), \quad \xi \sim \mathcal{D},

with each pk(xtx<t)p_k(x_t \mid x_{<t})0 viewed as an independent draw from the weight distribution pk(xtx<t)p_k(x_t \mid x_{<t})1. AME generalizes model soup, optimizer-augmented ensembling, and meta-adaptive aggregation (Lee et al., 20 Aug 2025).

2. Mathematical Guarantees and Equivalence Properties

Output Equivalence (ME)

Sampling each token via ME yields the same marginal distribution as the explicit ensemble:

pk(xtx<t)p_k(x_t \mid x_{<t})2

establishing the strict equivalence of ME’s amortized sampling to full ensemble sampling (Fu et al., 1 May 2026).

Parameter Aggregation Guarantees (AME)

AME admits the following theoretical properties (Lee et al., 20 Aug 2025):

  • Plain gradient descent ensembling (with learning rate pk(xtx<t)p_k(x_t \mid x_{<t})3, amplification pk(xtx<t)p_k(x_t \mid x_{<t})4) over all pk(xtx<t)p_k(x_t \mid x_{<t})5 recovers uniform model soup: pk(xtx<t)p_k(x_t \mid x_{<t})6.
  • More general convex combinations and adaptive soups are achievable via tailored learning rates and pseudogradient scaling.
  • Under suitable decaying learning rates (pk(xtx<t)p_k(x_t \mid x_{<t})7) and bounded pk(xtx<t)p_k(x_t \mid x_{<t})8, AME with AdaGrad or Adam converges with the ensemble parameter trajectory contained in the convex hull of the ingredients.
  • Model soup converges in probability to the expected value of pk(xtx<t)p_k(x_t \mid x_{<t})9.

3. Algorithmic Formulations and Implementation Details

ME for LLMs (Pseudocode)

πm=1/K\pi_m = 1/K2 Lazy KV cache synchronization ensures that each model’s cache is updated only when selected; missing tokens since last selection are “prefilled” in a batched pass (Fu et al., 1 May 2026).

AME in Parameter Space (Pseudocode)

πm=1/K\pi_m = 1/K3 Optimizer and schedule selection enables adaptation to task, stability, and exploration of the weight simplex (Lee et al., 20 Aug 2025).

4. Computational Trade-offs and Empirical Results

Inference-Time Speedup (ME)

For a pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).0-model ensemble and pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).1-token sequence:

  • Conventional explicit ensemble: pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).2 forward passes; pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).3 cost per token.
  • Mixture-model-like ensemble: pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).4 forward passes; pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).5 cost per token. Observed speedups for ME versus conventional ensembling are pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).6–pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).7 on NVIDIA H100, RTX 3090, A100, and V100 for two- and three-model LLM ensembles (Fu et al., 1 May 2026).

Test Performance Parity

ME statistically matches conventional ensemble decoding within pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).81 point. Example entries:

Task Best Single CE (k=5) ME (k=5)
GSM8K 79.77 83.14 82.97
MMLU 66.75 66.05 65.61
BBH 51.94 52.74 53.04
ARC 81.81 81.14 81.12

Parameter-space AME: Out-of-domain and Robustness

AME outperforms both best single expert and uniform model soup in OOD regimes. For 50% OOD on CIFAR-10:

  • Best Expert: pens(xtx<t)=1Kk=1Kpk(xtx<t).p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).9
  • Model Soup: MmM_m0
  • AME (Adam): MmM_m1 (Lee et al., 20 Aug 2025).

5. Connections to Routing, Model Soup, and AME Principles

AME provides a unifying perspective over various ensembling, mixture-of-experts, and meta-aggregation methods:

  • In output space, ME is a special case of token-level routing where routing decisions are made uniformly at random, rather than via a learned selector MmM_m2.
  • Model soup is realized as a single epoch of gradient descent ensembling in AME; the “pivoted pseudogradients” and adaptivity enable nonconvex or prioritized mixtures.
  • The cost of ensembling (full MmM_m3-fold computation) is amortized: ME achieves ensemble-level output statistics at only MmM_m4 the per-token cost; AME synthesizes a single model representing ensemble knowledge without storing all experts (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

6. Limitations and Research Directions

Known Limitations

  • ME is strictly valid only for random-sample decoding, not deterministic greedy/argmax decoding, where MmM_m5 random argmax.
  • Memory cost remains MmM_m6 model size; lazy KV cache synchronization can strain memory for large MmM_m7.
  • Performance benefits plateau with highly correlated or redundant base models.
  • In AME, theoretical analysis of hyperparameter-induced “phase transitions,” calibration, and layer/block-wise aggregation are unresolved (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

Open Directions

  • Non-uniform weights in ME (by altering MmM_m8) for quality/efficiency tradeoff.
  • Routing and learned randomization merge ME and mixture-of-experts, possibly via differentiable routers balancing latency and accuracy.
  • Adaptive MmM_m9: dynamically adjust ensemble size based on computational budget.
  • Greedy and beam heuristics: approximate ME for non-sampling decoders.
  • In parameter-space AME, extensions include “FedSoup” for federated aggregation, calibration analyses, and neural averaging of inputs or architectural blocks.

7. Applications and Impact

AME underpins substantial new efficiency/quality tradeoffs for large model deployment and federated data-free model synthesis:

  • For LLMs, ME achieves nearly identical generation quality to conventional ensemble decoding with πm=1/K\pi_m = 1/K0–πm=1/K\pi_m = 1/K1 speedup on current hardware.
  • Parameter-space AME produces robust neural aggregates for OOD generalization, memory preservation, and prototype synthesis, exceeding prior uniform or greedy soup baselines (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
  • These methods provide a foundation for scalable, privacy-preserving, and adaptive model sharing or deployment.

AME stands as a mathematically principled, empirically validated, and extensible framework that amortizes ensemble-level performance into practical, computationally tractable algorithms across both weight- and output-space modalities (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amortized Model Ensembling (AME).