Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bootstrapping Intermediate Model Sizes

Updated 16 February 2026
  • Bootstrapping intermediate model sizes are methodologies that efficiently generate a range of neural network architectures using weight-sharing, distillation, and Pareto optimization.
  • One approach leverages a super-network-based NAS where a single shared network is optimized by sampling subnetworks with a sandwich rule, dramatically reducing redundant computation.
  • Boomerang distillation enables zero-shot interpolation between teacher and student models, creating intermediate sizes that adapt to varied computational constraints.

Bootstrapping intermediate model sizes refers to methodologies that efficiently generate and select neural network architectures spanning a continuum of parameter counts and computational costs, without necessitating independent training for every possible size. This paradigm addresses the need for model families tailored to diverse deployment constraints, transitioning from monolithic training of each size to forms of automated construction, distillation, and interpolation. Recent advances crystallize two primary strategies for bootstrapping intermediate sizes: (1) automated super-network-based neural architecture search frameworks that directly expose a large subnetwork search space, and (2) distillation-based techniques that enable zero-shot interpolation between a compact student and its original teacher. These approaches minimize redundant compute and accelerate adaptation across memory, bandwidth, or latency regimes (Muñoz et al., 2021, Kangaslahti et al., 6 Oct 2025).

1. Super-Network Generation and Elastic Model Parameterization

The super-network paradigm initializes from a reference model mm, with layers Lm={1,...,N}\mathcal{L}^m = \{\ell_1, ..., \ell_N\}, and constructs a super-network Ω\Omega preserving topology and weights: LΩ=Lm\mathcal{L}^\Omega = \mathcal{L}^m, WΩ=WmW^\Omega = W^m. Layers are partitioned into static (LsΩ\mathcal{L}^\Omega_s) and elastic (LeΩ\mathcal{L}^\Omega_e) sets. Elastic width is introduced by discretizing each convolutional layer's channel count, CimaxC_i^{\max}, into a set Wi={Cimax,CimaxΔi,,Cimin}W_i = \{C_i^{\max}, C_i^{\max}-\Delta_i, \ldots, C_i^{\min}\}, parameterized via multipliers αiAi(0,1]\alpha_i \in \mathcal{A}_i \subset (0,1], such that Ci(αi)=αiCimaxC_i(\alpha_i) = \lfloor \alpha_i C_i^{\max} \rfloor. For elastic depth, a binary mask δj{0,1}\delta_j \in \{0,1\} gates block jj; the subnetwork depth is jδj\sum_j \delta_j.

Any child model aa is uniquely defined by the vector (α,δ)(\alpha, \delta) over width/depth choices, yielding a combinatorially large search space A={a(α,δ)αiAi,δj{0,1}}\mathcal{A} = \{a(\alpha, \delta) \mid \alpha_i \in \mathcal{A}_i, \delta_j \in \{0,1\}\} with A=(iAi)2K|\mathcal{A}| = (\prod_i |\mathcal{A}_i|) \cdot 2^K. Both minimal amina_\mathrm{min} and maximal amaxa_\mathrm{max} subnetworks are explicitly defined, with amaxa_\mathrm{max} constituting the original model mm (Muñoz et al., 2021).

2. Training Objectives and Efficient One-Shot Optimization

Instead of retraining every candidate model, a single shared-weight super-network Ω\Omega is optimized by sampling subnetworks during each minibatch, minimizing expected loss:

LSuper(w)=Eα,δPsample[Ltrain(w;Ω(α,δ))]+λR(w)+μEα,δ[DKD(Ω(α,δ),Teacher)].\mathcal{L}_\mathrm{Super}(w) = \mathbb{E}_{\alpha, \delta \sim P_\mathrm{sample}}[\mathcal{L}_\mathrm{train}(w; \Omega(\alpha, \delta))] + \lambda R(w) + \mu\, \mathbb{E}_{\alpha, \delta}[\mathcal{D}_\mathrm{KD}(\Omega(\alpha, \delta), \text{Teacher})].

Key components include weight decay R(w)R(w), distillation loss DKD\mathcal{D}_\mathrm{KD} (often from amaxa_\mathrm{max}), and a sandwich rule (activate amina_\mathrm{min}, amaxa_\mathrm{max}, plus NN random subnetworks per batch). This procedure, adapted from contemporary literature (Yu et al.'19), enables a single joint training process to furnish weights for all possible intermediate architectures. Empirical settings use batch sizes of 256, base learning rates of 0.1, N=2N=2–$4$, and 120 total epochs (Muñoz et al., 2021).

3. Subnetwork Evaluation and Search for Pareto-Optimal Trade-Offs

Each subnetwork aa is assessed via:

  • Number of parameters, #Parameters(a)=iLeΩCi(αi)Ci1(αi1)ki2+static terms\#\text{Parameters}(a) = \sum_{\ell_i \in \mathcal{L}_e^\Omega} C_i(\alpha_i) C_{i-1}(\alpha_{i-1}) k_i^2 + \text{static terms},
  • FLOPs, FLOPs(a)\text{FLOPs}(a) computed similarly at a per-layer level.

Once super-network training is complete, Pareto-optimal subnetworks are identified via NSGA-II, applying multi-objective optimization over accuracy and computational cost (FLOPs). The two objectives, f1(a)=Accuracyval(Ω(α,δ);w)f_1(a) = -\text{Accuracy}_\mathrm{val}(\Omega(\alpha, \delta); w^*) and f2(a)=FLOPs(a)f_2(a) = \text{FLOPs}(a), are minimized over the discrete architecture variables. Genetic operators (tournament selection, simulated binary crossover at 0.9, mutation at 0.02) are iteratively applied, yielding a non-dominated frontier Ao\mathcal{A}_o from which deployment candidates are selected (Muñoz et al., 2021).

4. Empirical Results and Practical Guidelines

Evaluations on benchmarks such as CIFAR-10 demonstrate that bootstrapped intermediate models often attain comparable or superior accuracy with significantly reduced computational demand. For example, a ResNet-50 baseline (amaxa_\mathrm{max}, 730 MFLOPs, 93.65% accuracy) produces a subnetwork (B-RC, 260 MFLOPs) with 93.70% accuracy—2.81× fewer FLOPs with a minor accuracy gain. Similarly, MobileNetV2 reductions achieve >2×>2\times FLOPs savings with negligible or no accuracy loss.

Fine-tuning selected subnetworks (10 epochs, batch = 128, cosine LR decay from 0.01 to 0.0001, in-place distillation at T=4T=4, αKD=0.1\alpha_{KD}=0.1) routinely recovers an additional 0.1–0.3% accuracy (Muñoz et al., 2021).

Best practices include defining resource budgets as FLOPs or latency, selecting width multipliers 1.0,0.75,0.5,0.25{1.0, 0.75, 0.5, 0.25} for stability, always employing distillation from amaxa_\mathrm{max}, and using the sandwich rule with N2N\geq2 subnetworks per batch.

5. Boomerang Distillation: Zero-Shot Model Size Interpolation

Boomerang distillation is an alternative paradigm for LLMs, producing an entire spectrum of intermediate-size models from a single teacher-student pair, with no gradient updates needed for the interpolates (Kangaslahti et al., 6 Oct 2025). The process comprises:

  1. Student Initialization via Layer Dropping: Partition a pretrained NN-layer transformer TT into MM contiguous blocks b(1),...,b(M)b^{(1)}, ..., b^{(M)}. The student SS is formed by dropping layers (e.g., every other) and inheriting surviving blocks from TT.
  2. Knowledge Distillation with Alignment: SS is trained on unlabelled data by minimizing a composite objective:

L=LCE+λKLLKL+λcosLcos\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{KL}}\mathcal{L}_{\mathrm{KL}} + \lambda_{\cos}\mathcal{L}_{\cos}

where LCE\mathcal{L}_{\mathrm{CE}} is next-token cross-entropy, LKL\mathcal{L}_{\mathrm{KL}} is KL divergence on teacher–student logits at temperature τ\tau, and Lcos\mathcal{L}_{\cos} is a per-block cosine distance between student block outputs and their corresponding last-layer teacher block states.

  1. Zero-Shot Patch-Back Interpolation: After distillation, intermediate-sized models are created via "patching"—replacing student blocks s(i)s^{(i)} with teacher blocks b(i)b^{(i)}, one at a time, yielding interpolants of size MM up to NN. Pseudocode is provided to construct such interpolates for any patch level kk. No retraining is required.

This process results in a family of models with steadily increasing performance as more teacher blocks are reinserted, with observed empirical curves tracing a nearly linear path between distilled student and teacher. For example, Qwen3-4B’s student (2.7B params, 52% acc) and teacher (4.4B, 68% acc) produce interpolates of 3.2B, 3.6B, 4.0B with 60%, 64%, 66% accuracy, approximating or exceeding models individually distilled at those sizes (Kangaslahti et al., 6 Oct 2025).

6. Alignment, Practical Guidelines, and Limitations

Boomerang success requires (i) initialization by pruning teacher layers directly, and (ii) inclusion of a per-layer cosine alignment loss. Ablations confirm that omission of alignment disrupts smooth interpolation. Cosine similarity between student and teacher block outputs is predictive of interpolation quality. When teacher layers are misaligned internally, adjusting block boundaries or patching order can mitigate drops.

Recommendations include using MN/2M\approx N/2 for student capacity, distilling on $1$–$3$B tokens of unlabelled text, λKL0.1\lambda_{KL}\approx0.1, λcos2/(M+1)\lambda_{\cos}\approx 2/(M+1), AdamW at lr=3×104\text{lr}=3\times 10^{-4}, cosine LR scheduling, and typical mini-batch sizes of 2K sequences. The canonical patching order is from deepest block backward, revisable for better alignment if necessary. Boomerang applies not just to Qwen and Llama but also off-the-shelf models such as DistilBERT/BERT and DistilGPT2/GPT2, with simple stacking of dropped teacher layers sufficing to generate interpolants (Kangaslahti et al., 6 Oct 2025).

7. Comparative Analysis and Implications

Both super-network-based NAS bootstrapping and boomerang distillation enable efficient construction of a full model size spectrum from a single or few training runs. Super-network approaches are architecture-agnostic and exploit weight-sharing to amortize training cost across billions of subnetwork candidates, facilitating Pareto-optimal selection for arbitrary constraints (Muñoz et al., 2021). Boomerang distillation achieves similarly fine-grained coverage in transformers, with zero-shot construction of new sizes and competitive performance relative to independently trained models (Kangaslahti et al., 6 Oct 2025).

A plausible implication is that these paradigms will render model family construction orders of magnitude more efficient, potentially shifting the norm from per-size pretraining to shared, adaptable infrastructures. Both frameworks rely critically on principled sharing (either of weights or representational alignment) to avoid the inaccuracy typically associated with naive pruning or unaligned knowledge transfer. As deployment environments continue to diversify, such bootstrapping mechanisms will be central to highly adaptive and resource-scalable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bootstrapping Intermediate Model Sizes.