Bootstrapping Intermediate Model Sizes

Updated 16 February 2026

Bootstrapping intermediate model sizes are methodologies that efficiently generate a range of neural network architectures using weight-sharing, distillation, and Pareto optimization.
One approach leverages a super-network-based NAS where a single shared network is optimized by sampling subnetworks with a sandwich rule, dramatically reducing redundant computation.
Boomerang distillation enables zero-shot interpolation between teacher and student models, creating intermediate sizes that adapt to varied computational constraints.

Bootstrapping intermediate model sizes refers to methodologies that efficiently generate and select neural network architectures spanning a continuum of parameter counts and computational costs, without necessitating independent training for every possible size. This paradigm addresses the need for model families tailored to diverse deployment constraints, transitioning from monolithic training of each size to forms of automated construction, distillation, and interpolation. Recent advances crystallize two primary strategies for bootstrapping intermediate sizes: (1) automated super-network-based neural architecture search frameworks that directly expose a large subnetwork search space, and (2) distillation-based techniques that enable zero-shot interpolation between a compact student and its original teacher. These approaches minimize redundant compute and accelerate adaptation across memory, bandwidth, or latency regimes (Muñoz et al., 2021, Kangaslahti et al., 6 Oct 2025).

1. Super-Network Generation and Elastic Model Parameterization

The super-network paradigm initializes from a reference model $m$ , with layers $\mathcal{L}^m = \{\ell_1, ..., \ell_N\}$ , and constructs a super-network $\Omega$ preserving topology and weights: $\mathcal{L}^\Omega = \mathcal{L}^m$ , $W^\Omega = W^m$ . Layers are partitioned into static ( $\mathcal{L}^\Omega_s$ ) and elastic ( $\mathcal{L}^\Omega_e$ ) sets. Elastic width is introduced by discretizing each convolutional layer's channel count, $C_i^{\max}$ , into a set $W_i = \{C_i^{\max}, C_i^{\max}-\Delta_i, \ldots, C_i^{\min}\}$ , parameterized via multipliers $\alpha_i \in \mathcal{A}_i \subset (0,1]$ , such that $C_i(\alpha_i) = \lfloor \alpha_i C_i^{\max} \rfloor$ . For elastic depth, a binary mask $\delta_j \in \{0,1\}$ gates block $j$ ; the subnetwork depth is $\sum_j \delta_j$ .

Any child model $a$ is uniquely defined by the vector $(\alpha, \delta)$ over width/depth choices, yielding a combinatorially large search space $\mathcal{A} = \{a(\alpha, \delta) \mid \alpha_i \in \mathcal{A}_i, \delta_j \in \{0,1\}\}$ with $|\mathcal{A}| = (\prod_i |\mathcal{A}_i|) \cdot 2^K$ . Both minimal $a_\mathrm{min}$ and maximal $a_\mathrm{max}$ subnetworks are explicitly defined, with $a_\mathrm{max}$ constituting the original model $m$ (Muñoz et al., 2021).

2. Training Objectives and Efficient One-Shot Optimization

Instead of retraining every candidate model, a single shared-weight super-network $\Omega$ is optimized by sampling subnetworks during each minibatch, minimizing expected loss:

$\mathcal{L}_\mathrm{Super}(w) = \mathbb{E}_{\alpha, \delta \sim P_\mathrm{sample}}[\mathcal{L}_\mathrm{train}(w; \Omega(\alpha, \delta))] + \lambda R(w) + \mu\, \mathbb{E}_{\alpha, \delta}[\mathcal{D}_\mathrm{KD}(\Omega(\alpha, \delta), \text{Teacher})].$

Key components include weight decay $R(w)$ , distillation loss $\mathcal{D}_\mathrm{KD}$ (often from $a_\mathrm{max}$ ), and a sandwich rule (activate $a_\mathrm{min}$ , $a_\mathrm{max}$ , plus $N$ random subnetworks per batch). This procedure, adapted from contemporary literature (Yu et al.'19), enables a single joint training process to furnish weights for all possible intermediate architectures. Empirical settings use batch sizes of 256, base learning rates of 0.1, $N=2$ –$4$, and 120 total epochs (Muñoz et al., 2021).

3. Subnetwork Evaluation and Search for Pareto-Optimal Trade-Offs

Each subnetwork $a$ is assessed via:

Number of parameters, $\#\text{Parameters}(a) = \sum_{\ell_i \in \mathcal{L}_e^\Omega} C_i(\alpha_i) C_{i-1}(\alpha_{i-1}) k_i^2 + \text{static terms}$ ,
FLOPs, $\text{FLOPs}(a)$ computed similarly at a per-layer level.

Once super-network training is complete, Pareto-optimal subnetworks are identified via NSGA-II, applying multi-objective optimization over accuracy and computational cost (FLOPs). The two objectives, $f_1(a) = -\text{Accuracy}_\mathrm{val}(\Omega(\alpha, \delta); w^*)$ and $f_2(a) = \text{FLOPs}(a)$ , are minimized over the discrete architecture variables. Genetic operators (tournament selection, simulated binary crossover at 0.9, mutation at 0.02) are iteratively applied, yielding a non-dominated frontier $\mathcal{A}_o$ from which deployment candidates are selected (Muñoz et al., 2021).

4. Empirical Results and Practical Guidelines

Evaluations on benchmarks such as CIFAR-10 demonstrate that bootstrapped intermediate models often attain comparable or superior accuracy with significantly reduced computational demand. For example, a ResNet-50 baseline ( $a_\mathrm{max}$ , 730 MFLOPs, 93.65% accuracy) produces a subnetwork (B-RC, 260 MFLOPs) with 93.70% accuracy—2.81× fewer FLOPs with a minor accuracy gain. Similarly, MobileNetV2 reductions achieve $>2\times$ FLOPs savings with negligible or no accuracy loss.

Fine-tuning selected subnetworks (10 epochs, batch = 128, cosine LR decay from 0.01 to 0.0001, in-place distillation at $T=4$ , $\alpha_{KD}=0.1$ ) routinely recovers an additional 0.1–0.3% accuracy (Muñoz et al., 2021).

Best practices include defining resource budgets as FLOPs or latency, selecting width multipliers ${1.0, 0.75, 0.5, 0.25}$ for stability, always employing distillation from $a_\mathrm{max}$ , and using the sandwich rule with $N\geq2$ subnetworks per batch.

5. Boomerang Distillation: Zero-Shot Model Size Interpolation

Boomerang distillation is an alternative paradigm for LLMs, producing an entire spectrum of intermediate-size models from a single teacher-student pair, with no gradient updates needed for the interpolates (Kangaslahti et al., 6 Oct 2025). The process comprises:

Student Initialization via Layer Dropping: Partition a pretrained $N$ -layer transformer $T$ into $M$ contiguous blocks $b^{(1)}, ..., b^{(M)}$ . The student $S$ is formed by dropping layers (e.g., every other) and inheriting surviving blocks from $T$ .
Knowledge Distillation with Alignment: $S$ is trained on unlabelled data by minimizing a composite objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{KL}}\mathcal{L}_{\mathrm{KL}} + \lambda_{\cos}\mathcal{L}_{\cos}$

where $\mathcal{L}_{\mathrm{CE}}$ is next-token cross-entropy, $\mathcal{L}_{\mathrm{KL}}$ is KL divergence on teacher–student logits at temperature $\tau$ , and $\mathcal{L}_{\cos}$ is a per-block cosine distance between student block outputs and their corresponding last-layer teacher block states.

Zero-Shot Patch-Back Interpolation: After distillation, intermediate-sized models are created via "patching"—replacing student blocks $s^{(i)}$ with teacher blocks $b^{(i)}$ , one at a time, yielding interpolants of size $M$ up to $N$ . Pseudocode is provided to construct such interpolates for any patch level $k$ . No retraining is required.

This process results in a family of models with steadily increasing performance as more teacher blocks are reinserted, with observed empirical curves tracing a nearly linear path between distilled student and teacher. For example, Qwen3-4B’s student (2.7B params, 52% acc) and teacher (4.4B, 68% acc) produce interpolates of 3.2B, 3.6B, 4.0B with 60%, 64%, 66% accuracy, approximating or exceeding models individually distilled at those sizes (Kangaslahti et al., 6 Oct 2025).

6. Alignment, Practical Guidelines, and Limitations

Boomerang success requires (i) initialization by pruning teacher layers directly, and (ii) inclusion of a per-layer cosine alignment loss. Ablations confirm that omission of alignment disrupts smooth interpolation. Cosine similarity between student and teacher block outputs is predictive of interpolation quality. When teacher layers are misaligned internally, adjusting block boundaries or patching order can mitigate drops.

Recommendations include using $M\approx N/2$ for student capacity, distilling on $1$–$3$B tokens of unlabelled text, $\lambda_{KL}\approx0.1$ , $\lambda_{\cos}\approx 2/(M+1)$ , AdamW at $\text{lr}=3\times 10^{-4}$ , cosine LR scheduling, and typical mini-batch sizes of 2K sequences. The canonical patching order is from deepest block backward, revisable for better alignment if necessary. Boomerang applies not just to Qwen and Llama but also off-the-shelf models such as DistilBERT/BERT and DistilGPT2/GPT2, with simple stacking of dropped teacher layers sufficing to generate interpolants (Kangaslahti et al., 6 Oct 2025).

7. Comparative Analysis and Implications

Both super-network-based NAS bootstrapping and boomerang distillation enable efficient construction of a full model size spectrum from a single or few training runs. Super-network approaches are architecture-agnostic and exploit weight-sharing to amortize training cost across billions of subnetwork candidates, facilitating Pareto-optimal selection for arbitrary constraints (Muñoz et al., 2021). Boomerang distillation achieves similarly fine-grained coverage in transformers, with zero-shot construction of new sizes and competitive performance relative to independently trained models (Kangaslahti et al., 6 Oct 2025).

A plausible implication is that these paradigms will render model family construction orders of magnitude more efficient, potentially shifting the norm from per-size pretraining to shared, adaptable infrastructures. Both frameworks rely critically on principled sharing (either of weights or representational alignment) to avoid the inaccuracy typically associated with naive pruning or unaligned knowledge transfer. As deployment environments continue to diversify, such bootstrapping mechanisms will be central to highly adaptive and resource-scalable AI systems.

Markdown Upgrade to Chat

References (2)

Enabling NAS with Automated Super-Network Generation (2021)

Boomerang Distillation Enables Zero-Shot Model Size Interpolation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bootstrapping Intermediate Model Sizes.