Amortized Model Ensembling (AME)

Updated 3 July 2026

Amortized Model Ensembling (AME) is a meta-optimization framework that combines ensemble benefits with reduced inference and memory costs.
It utilizes output-space stochastic methods like Mixture-model-like Ensembles and parameter-space gradient-based aggregation to synthesize ensemble predictions.
AME achieves significant speedups and robustness by generalizing classic model soup techniques and adapting to diverse computational scenarios.

Amortized Model Ensembling (AME) is a meta-optimization paradigm for synthesizing the benefits of model ensembling while reducing the computational or memory cost over conventional pointwise ensemble averaging. AME encompasses both stochastic inference-time ensembling methods—such as the Mixture-model-like Ensemble (ME) for autoregressive LLMs—and parameter-space aggregation strategies—such as gradient-based neural averaging—unified by the principle of amortizing the ensemble computation over time, space, or iterations. AME enables the construction of ensemble-level predictions or solutions at significantly reduced inference or integration cost, and generalizes classic uniform model soup and mixture-of-experts frameworks in both theoretical and practical settings (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

1. Formalization and Unification Framework

AME is instantiated in two complementary settings: (a) output-space stochastic ensembling, where a sequence of forward passes or samples is amortized across base models (e.g., ME for LLMs), and (b) parameter-space aggregation, where a single set of model weights is synthesized from multiple experts via pseudogradient meta-optimization (data-free neural averaging).

Output-space AME: Mixture-model-like Ensemble (ME)

Given fine-tuned models $M_1, \ldots, M_K$ with next-token distributions $p_k(x_t \mid x_{<t})$ , the conventional ensemble distribution is:

$p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$

ME instead draws, at each token step, a single model $M_m$ (with probability $\pi_m = 1/K$ ) and samples $x_t \sim p_m(\cdot\,|\,x_{<t})$ . The marginal next-token distribution under ME is provably equal to $p_{ens}$ (Fu et al., 1 May 2026).

Parameter-space AME: Neural Averaging via Meta-Optimization

Given $N$ pretrained DNNs with weights $x_1, \ldots, x_N \in \mathbb{R}^d$ , AME defines “neural averaging” by performing stochastic gradient-based aggregation over a quadratic proxy loss,

$f(x; \xi) = \frac{1}{2}\left(\|x - \xi\|^2 - \|\xi\|^2\right), \quad \xi \sim \mathcal{D},$

with each $p_k(x_t \mid x_{<t})$ 0 viewed as an independent draw from the weight distribution $p_k(x_t \mid x_{<t})$ 1. AME generalizes model soup, optimizer-augmented ensembling, and meta-adaptive aggregation (Lee et al., 20 Aug 2025).

2. Mathematical Guarantees and Equivalence Properties

Output Equivalence (ME)

Sampling each token via ME yields the same marginal distribution as the explicit ensemble:

$p_k(x_t \mid x_{<t})$ 2

establishing the strict equivalence of ME’s amortized sampling to full ensemble sampling (Fu et al., 1 May 2026).

Parameter Aggregation Guarantees (AME)

AME admits the following theoretical properties (Lee et al., 20 Aug 2025):

Plain gradient descent ensembling (with learning rate $p_k(x_t \mid x_{<t})$ 3, amplification $p_k(x_t \mid x_{<t})$ 4) over all $p_k(x_t \mid x_{<t})$ 5 recovers uniform model soup: $p_k(x_t \mid x_{<t})$ 6.
More general convex combinations and adaptive soups are achievable via tailored learning rates and pseudogradient scaling.
Under suitable decaying learning rates ( $p_k(x_t \mid x_{<t})$ 7) and bounded $p_k(x_t \mid x_{<t})$ 8, AME with AdaGrad or Adam converges with the ensemble parameter trajectory contained in the convex hull of the ingredients.
Model soup converges in probability to the expected value of $p_k(x_t \mid x_{<t})$ 9.

3. Algorithmic Formulations and Implementation Details

ME for LLMs (Pseudocode)

$\pi_m = 1/K$ 2 Lazy KV cache synchronization ensures that each model’s cache is updated only when selected; missing tokens since last selection are “prefilled” in a batched pass (Fu et al., 1 May 2026).

AME in Parameter Space (Pseudocode)

$\pi_m = 1/K$ 3 Optimizer and schedule selection enables adaptation to task, stability, and exploration of the weight simplex (Lee et al., 20 Aug 2025).

4. Computational Trade-offs and Empirical Results

Inference-Time Speedup (ME)

For a $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 0-model ensemble and $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 1-token sequence:

Conventional explicit ensemble: $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 2 forward passes; $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 3 cost per token.
Mixture-model-like ensemble: $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 4 forward passes; $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 5 cost per token. Observed speedups for ME versus conventional ensembling are $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 6– $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 7 on NVIDIA H100, RTX 3090, A100, and V100 for two- and three-model LLM ensembles (Fu et al., 1 May 2026).

Test Performance Parity

ME statistically matches conventional ensemble decoding within $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 81 point. Example entries:

Task	Best Single	CE (k=5)	ME (k=5)
GSM8K	79.77	83.14	82.97
MMLU	66.75	66.05	65.61
BBH	51.94	52.74	53.04
ARC	81.81	81.14	81.12

Parameter-space AME: Out-of-domain and Robustness

AME outperforms both best single expert and uniform model soup in OOD regimes. For 50% OOD on CIFAR-10:

Best Expert: $p_{ens}(x_t \mid x_{<t}) = \frac{1}{K} \sum_{k=1}^K p_k(x_t \mid x_{<t}).$ 9
Model Soup: $M_m$ 0
AME (Adam): $M_m$ 1 (Lee et al., 20 Aug 2025).

5. Connections to Routing, Model Soup, and AME Principles

AME provides a unifying perspective over various ensembling, mixture-of-experts, and meta-aggregation methods:

In output space, ME is a special case of token-level routing where routing decisions are made uniformly at random, rather than via a learned selector $M_m$ 2.
Model soup is realized as a single epoch of gradient descent ensembling in AME; the “pivoted pseudogradients” and adaptivity enable nonconvex or prioritized mixtures.
The cost of ensembling (full $M_m$ 3-fold computation) is amortized: ME achieves ensemble-level output statistics at only $M_m$ 4 the per-token cost; AME synthesizes a single model representing ensemble knowledge without storing all experts (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

6. Limitations and Research Directions

Known Limitations

ME is strictly valid only for random-sample decoding, not deterministic greedy/argmax decoding, where $M_m$ 5 random argmax.
Memory cost remains $M_m$ 6 model size; lazy KV cache synchronization can strain memory for large $M_m$ 7.
Performance benefits plateau with highly correlated or redundant base models.
In AME, theoretical analysis of hyperparameter-induced “phase transitions,” calibration, and layer/block-wise aggregation are unresolved (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

Open Directions

Non-uniform weights in ME (by altering $M_m$ 8) for quality/efficiency tradeoff.
Routing and learned randomization merge ME and mixture-of-experts, possibly via differentiable routers balancing latency and accuracy.
Adaptive $M_m$ 9: dynamically adjust ensemble size based on computational budget.
Greedy and beam heuristics: approximate ME for non-sampling decoders.
In parameter-space AME, extensions include “FedSoup” for federated aggregation, calibration analyses, and neural averaging of inputs or architectural blocks.

7. Applications and Impact

AME underpins substantial new efficiency/quality tradeoffs for large model deployment and federated data-free model synthesis:

For LLMs, ME achieves nearly identical generation quality to conventional ensemble decoding with $\pi_m = 1/K$ 0– $\pi_m = 1/K$ 1 speedup on current hardware.
Parameter-space AME produces robust neural aggregates for OOD generalization, memory preservation, and prototype synthesis, exceeding prior uniform or greedy soup baselines (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).
These methods provide a foundation for scalable, privacy-preserving, and adaptive model sharing or deployment.

AME stands as a mathematically principled, empirically validated, and extensible framework that amortizes ensemble-level performance into practical, computationally tractable algorithms across both weight- and output-space modalities (Fu et al., 1 May 2026, Lee et al., 20 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Rethinking LLM Ensembling from the Perspective of Mixture Models (2026)

On Defining Neural Averaging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amortized Model Ensembling (AME).

Amortized Model Ensembling (AME)

1. Formalization and Unification Framework

Output-space AME: Mixture-model-like Ensemble (ME)

Parameter-space AME: Neural Averaging via Meta-Optimization

2. Mathematical Guarantees and Equivalence Properties

Output Equivalence (ME)

Parameter Aggregation Guarantees (AME)

3. Algorithmic Formulations and Implementation Details

ME for LLMs (Pseudocode)

AME in Parameter Space (Pseudocode)

4. Computational Trade-offs and Empirical Results

Inference-Time Speedup (ME)

Test Performance Parity

Parameter-space AME: Out-of-domain and Robustness

5. Connections to Routing, Model Soup, and AME Principles

6. Limitations and Research Directions

Known Limitations

Open Directions

7. Applications and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Amortized Model Ensembling (AME)

1. Formalization and Unification Framework

Output-space AME: Mixture-model-like Ensemble (ME)

Parameter-space AME: Neural Averaging via Meta-Optimization

2. Mathematical Guarantees and Equivalence Properties

Output Equivalence (ME)

Parameter Aggregation Guarantees (AME)

3. Algorithmic Formulations and Implementation Details

ME for LLMs (Pseudocode)

AME in Parameter Space (Pseudocode)

4. Computational Trade-offs and Empirical Results

Inference-Time Speedup (ME)

Test Performance Parity

Parameter-space AME: Out-of-domain and Robustness

5. Connections to Routing, Model Soup, and AME Principles

6. Limitations and Research Directions

Known Limitations

Open Directions

7. Applications and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research