Pruning Before Training (PBT) in Neural Networks

Updated 24 February 2026

PBT is a neural network compression strategy that embeds pruning into the model design phase instead of applying it post-training.
It leverages optimization techniques, distribution sampling, and NAS methods to directly navigate the accuracy–efficiency Pareto frontier.
Empirical studies show that PBT can significantly reduce FLOPs and model size while maintaining high accuracy across different architectures.

Pruning Before Training (PBT) describes a class of neural network compression strategies in which pruning—the removal of weights, channels, blocks, or other substructures—is integrated directly into the network design phase, often before or during the main phase of supervised training. While conventional pruning protocols typically excise parameters after a baseline model is trained, PBT leverages search and constraint optimization to select efficient sub-architectures early in the pipeline. PBT approaches are often, but not exclusively, intertwined with Neural Architecture Search (NAS), yielding direct exploration of the accuracy/efficiency Pareto frontier for compact, hardware-conscious deployment.

1. Conceptual Landscape and Taxonomy

PBT encompasses a spectrum of methodologies unified by the notion that pruning decisions are not delayed until post hoc fine-tuning, but rather made as an implicit or explicit outcome of the model construction, parameter initialization, or supernet training regime. Distinct subcategories include:

NAS-integrated PBT: Architecture search explicitly includes per-layer sparsity, channel width, layer depth, or operator selection as first-class variables, with pruning embedded in or co-optimized with search (Zheng et al., 2019, Dai et al., 2020, Li et al., 2022).
Distribution- and proxy-driven PBT: Progressive sampling, probabilistic gating, or gradient-based proxy objectives identify low-utility structures for early removal (Zheng et al., 2019, Malettira et al., 2 Feb 2026, Li et al., 2020).
Weight-sharing and supernet-based PBT: Single-shot or two-stage supernet training, where subnetworks (with varying levels of pruning) are sampled and trained jointly, so only the most promising subnets are retained for downstream deployment (Klein et al., 2024, Abebe et al., 15 Jan 2025, Sarah et al., 2024).
Differentiable and gradient-driven PBT: Latent variable or gating schemes (e.g., Gumbel-Softmax, ℓ₁-proximal optimization) directly encode pruning as part of the optimization landscape (Li et al., 2020, Li et al., 2022).

This conception departs from classical “post-training” magnitude or sensitivity-based pruning, where weights are trimmed based on activity measured after full model convergence.

2. Formal Problem Statement and Search Space Construction

The formal objective of PBT is typically a bilevel or multi-objective optimization over architectural and pruning variables, subject to resource budgets:

${\min_{\mathcal{A}, \theta} \; L_{\text{train}}(f(x;\theta, \mathcal{A})) \;\;\text{subject to}\;\; \text{FLOPs}(\mathcal{A}) \leq B, \;\; \text{Params}(\mathcal{A}) \leq S_{max}}$

Here, $\mathcal{A}$ parameterizes the architectural/pruning choices—such as binary masks, per-layer channel widths, block retention masks, or latent vectors controlling hypernetworks—and $\theta$ the network weights. In the NAS context, $\mathcal{A}$ may combine operator selection per edge in a cell graph (Zheng et al., 2019, Laube et al., 2019), as well as layerwise/channelwise sparsity (Li et al., 2022, Lin et al., 2021).

Table 1 summarizes representative PBT-related search spaces:

Paper	Architecture Pruning Variables	Resource Constraints
DDPNAS (Zheng et al., 2019)	Discrete op selection per edge in DAG	FLOPs, latency
TAS (Dong et al., 2019)	Per-layer channel/depth multinomials	Expected FLOPs penalty
PaS (Li et al., 2022)	Binary gates per channel	Target MACs (hard/soft)
DHP (Li et al., 2020)	Latent vectors $\{z^l\}$ , input/output	Target FLOPs
Joint-DetNAS (Yao et al., 2021)	Stage/block width/depth, channel masks	GFLOPs

Methods differ in whether the architectural variables are discrete, continuous, probabilistic, differentiable, or updated by evolutionary, RL, or gradient-based strategies.

3. Principal Algorithms and Optimization Strategies

Several optimization approaches are prevalent in PBT:

a) Progressive distribution pruning: Methods such as DDPNAS (Zheng et al., 2019) maintain a joint categorical distribution over architectural/pruning choices, prune low-probability options after each round, and update probabilities using performance-based utilities with momentum. The search proceeds until only one candidate per slot remains; a final exhaustive scan yields the optimal subnetwork under constraint.

b) Prune-and-replace and morph-based search: PR-DARTS (Laube et al., 2019) alternates between (i) pruning poor operators from the candidate pool on each cell edge and (ii) introducing new operations via network morphisms, followed by retraining.

c) Differentiable pruning via proxies/latent variables: DHP (Li et al., 2020) utilizes hypernetworks parameterized by sparse latent vectors $\{z^l\}$ , optimized jointly with network weights under an $\ell_1$ penalty. Proximal gradient steps induce channel-level sparsity. PaS (Li et al., 2022) employs gating vectors updated through straight-through estimators (STE) with hard or soft resource constraints.

d) Supernet and sandwich-training: LLaMA-NAS (Sarah et al., 2024), SuperSAM (Abebe et al., 15 Jan 2025), and WS-NAS (Klein et al., 2024) train a supernetwork embedding all possible pruned configurations, leveraging weight-sharing. The NAS search (evolutionary, random, or Bayesian) identifies Pareto-optimal subnetworks post hoc, many of which were never explicitly trained as standalone models.

e) Reinforcement learning or Bayesian meta-modeling: NPAS (Li et al., 2020) employs meta-controllers (Q-learning, BO) that select filter type, per-layer pruning scheme, and pruning rate. Real-world latency is measured per candidate to align search with deployment constraints.

f) Zero-shot functional proxies: TraceNAS (Malettira et al., 2 Feb 2026) introduces a gradient trace correlation proxy: for each subnetwork, similarity to the pre-trained model’s loss landscape is measured via blockwise Pearson correlation of low-rank gradient traces. This allows candidate evaluation without full fine-tuning.

4. Practical Pipelines and Implementation Patterns

The practical instantiations of PBT frequently exhibit the following steps:

Supernet or broad model preparation: Initialize a large, overparameterized model or construct a supernet with architectural elasticity.
Search/optimization loop:
- Sample or evaluate candidate subnetworks (via pruning variables)
- Score via proxy (validation, predicted, or zero-shot metric)
- Update pruning/architectural variables (distribution shift, evolution, gradient steps)
- Apply hard pruning/removal after scheduled steps or as dictated by criterion
(Optional) Final selection & post-processing: Extract optimal or Pareto-subnetworks; often includes final fine-tuning or continued pretraining with frozen pruning decisions.
Deployment-oriented steps: Structural re-parameterization (PaS, DHP), code generation for hardware (NPAS), or program autotuning (SuperSAM).

A high-level algorithmic outline (DDPNAS (Zheng et al., 2019)):

$\mathcal{A}$ 4

5. Empirical Results and Comparative Insights

PBT approaches consistently demonstrate that integrating pruning into the search or training pipeline yields superior accuracy–efficiency trade-offs:

Model quality at fixed cost: On ImageNet, AACP (Lin et al., 2021) achieves 42% FLOP reduction in ResNet-50 with minimal ( $0.18\%$ ) drop in top-1 accuracy, strictly outperforming contemporaneous NAS-based pruning and magnitude-based methods.
Search efficiency: DDPNAS discovers ImageNet-mobile architectures in $1.8$ GPU-hours (Zheng et al., 2019); DA-NAS attains $2\times$ speedup on “Shuffle+Mobile” search versus prior baselines (Dai et al., 2020).
LLM and Transformer scale: WS-NAS (Klein et al., 2024) finds $\mathcal{A}$ 0 compressed BERT/RoBERTa variants with less than $\mathcal{A}$ 1 point accuracy loss, outperforming fixed-head/quantized baselines. TraceNAS (Malettira et al., 2 Feb 2026) matches or exceeds training-aware pruning at one-tenth GPU cost, realizing $\mathcal{A}$ 2 throughput gains after search.
Detection, segmentation, generative tasks: Joint-DetNAS (Yao et al., 2021) delivers $\mathcal{A}$ 3 mAP at fixed FLOPs over “prune-only” pipelines. DHP and PaS demonstrate compressive gains on classification, SISR, denoising, and instance segmentation with maintained or improved accuracy (Li et al., 2020, Li et al., 2022).

Comparison of pruning schemes, NAS-only, and joint approaches reveals that hybrid NAS+prune frameworks strictly dominate simple post-training pruning on the accuracy–cost frontier (Yao et al., 2021).

6. Practical Considerations, Limitations, and Extensions

Key practical insights for adopting PBT include:

Constraint handling: IDE (AACP) and PaS demonstrate that hard FLOPs/parameter budgets can be enforced through vector quantization/projection (Lin et al., 2021, Li et al., 2022); Lagrangian relaxations can also be used for soft constraints.
Batchnorm and representation recalibration: PBT pipelines often include explicit BN recalibration to improve subnetwork validity post-pruning (Lin et al., 2021).
Search complexity: Search spaces are exponentially large in depth or per-layer width (noted as a limitation in (Dong et al., 2019, Sarah et al., 2024)); per-layer step vectors, or compressed representations (AACP, TAS), address tractability.
Proxy reliability: The fidelity of analytic or zero-shot proxies (L₁-norm, gradient correlation) is domain dependent; some methods may fail in regimes with uncooperative filter dynamics (Lin et al., 2021, Malettira et al., 2 Feb 2026).
Hardware and deployment: Integrated compiler- or autotuner-based code generation, as in NPAS (Li et al., 2020) or SuperSAM (Abebe et al., 15 Jan 2025), is necessary to translate theoretical compressions into real-world speedups.

Open challenges include improved proxies for non-pretrained settings, direct latency optimization, and generalization to multimodal or more complex task structures.

7. Significance and Emerging Directions

The PBT paradigm demonstrably advances the accuracy/efficiency envelope for compressed and compact neural networks by fusing pruning and architecture selection. Direct layer/channel/block selection within the training or NAS loop enables the discovery of non-uniform allocations that outperform uniform or magnitude-only counterparts. Recent momentum towards zero-shot and differentiable proxies (TraceNAS, DHP), NAS-based pruning for LLMs and foundation models (SuperSAM, LLaMA-NAS), and hardware/compiler-aligned search (NPAS) signals a convergence of model search, pruning, and deployment-aware optimization.

By framing pruning as a component of the architecture definition itself, PBT eliminates the dichotomy between network compression and design, yielding algorithms and pipelines that are more adaptive, efficient, and better matched to target resource budgets across a diversity of neural network workloads (Zheng et al., 2019, Li et al., 2020, Li et al., 2022, Malettira et al., 2 Feb 2026, Klein et al., 2024, Sarah et al., 2024, Abebe et al., 15 Jan 2025).