Supermasks in Neural Networks

Updated 11 November 2025

Supermasks are binary and ternary masks applied element-wise to randomly initialized neural networks to reveal efficient subnetworks performing above chance without weight training.
They are central to research in sparsification, efficient transfer learning, and continual learning, demonstrating that connectivity patterns can substitute for full weight optimization.
Algorithmic approaches such as train–prune methods and edge-popup optimization effectively discover supermasks while achieving significant memory reduction and stable performance.

A supermask is a binary (or, in some extensions, ternary) mask applied element-wise to the weights of a randomly initialized, untrained neural network, creating a subnetwork capable of achieving above-chance—or even competitive—accuracy on target tasks without weight training. This structural phenomenon, closely related to the Lottery Ticket Hypothesis, has catalyzed advances in network sparsification, efficient transfer learning, continual learning, and highly compressive architectures. Supermasks reveal that combinatorial exploration of connectivity patterns at initialization suffices to uncover trainable or even directly usable subnetworks, thus challenging the primacy of weight optimization in deep learning.

1. Formal Definition and Construction of Supermasks

Given a neural network with weight tensors $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ and binary mask $M \in \{0,1\}^{d_{\text{out}} \times d_{\text{in}}}$ , the supermask network is defined by the element-wise product $W' = W \odot M$ . The network $f(x; W_i \odot M)$ , with $W_i$ an untrained random initialization, is said to contain a supermask $M$ if it yields above-chance performance on the task without any further training of $W_i$ (Zhou et al., 2019).

Construction via trained criteria: A standard heuristic constructs $M$ by (i) training $W_i$ to final weights $W_f$ ; (ii) scoring each entry by $|W_f|$ or related metrics; (iii) thresholding to keep the top $p\%$ weights per layer: $M_{p,q} = \begin{cases} 1, & s_{p,q} \ge \tau \ 0, & s_{p,q} < \tau \end{cases}$ with $\tau$ chosen such that a target sparsity is met.

Empirically, “large final” ( $|W_f|$ ) and “magnitude increase” ( $|W_f|-|W_i|$ ) are effective criteria; random or inverted criteria yield inferior masks (Zhou et al., 2019).

Direct mask learning: Alternatively, one can fix all $W=W_i$ and introduce trainable real-valued score matrices $S$ , passing them through a (possibly stochastic) thresholding function to produce $M$ ; $S$ is then optimized to improve performance, with gradients bypassing the argmax via the straight-through estimator (“edge-popup algorithm”) (García-Arias et al., 2021, Koster et al., 2022).

In the signed supermask extension, $M$ is ternary $\{-1,0,+1\}$ , with the mask function $g(x)$ assigning sign-flipping where advantageous (Koster et al., 2022).

2. Theoretical Principles and Relation to Lottery Ticket Hypothesis

The supermask paradigm is rooted in the insight that random neural networks are highly overparameterized, so sparse subnetworks with significant expressive capacity exist without weight optimization.

Sign-dominance and implicit training: The sign pattern of surviving weights—rather than their precise magnitudes—determines the “attraction basin” of the solution. Experimental ablations show that “mask-1 actions” preserving signs yield full performance, whereas arbitrary re-sampling or permutations of magnitudes degrade accuracy (Zhou et al., 2019). Conversely, zeroing pruned weights (mask-0) reflects the natural endpoint of their training trajectory, yielding better subnetworks than freezing them at initial random values—a mask thus mimics a one-step training update (Zhou et al., 2019).

Beyond training: supermask vs. lottery ticket: The canonical Lottery Ticket Hypothesis describes the existence of a trainable subnetwork at initialization. Supermasks demonstrate that, for certain tasks and mask selection criteria, identifying the mask alone suffices to realize above-random performance, eliminating the need for any weight training (Zhou et al., 2019). “Signed” supermasks further extend this to permit sign-inversion of weights, doubling the available expressivity under high sparsity ( $>99\%$ ) (Koster et al., 2022).

3. Algorithmic Approaches for Supermask Discovery

The two main procedural paradigms for supermask identification are:

Table 1: Principal Mask Discovery Algorithms

Approach	Learned Parameters	Description
Train–prune "large final"	$W$ : weights	Train $W_i\to W_f$ ; score, threshold
Mask optimization	$S$ : scores	Train $S$ on frozen $W$ (edge-popup/STE)
Signed supermask	$S$ (ternary mask)	Train $S$ ; allow mask $\{-1,0,+1\}$

Edge-popup (score-based) optimization: Initialize $W$ (frozen) and $S$ , select the top- $k\%$ of $S$ per layer to form $M$ , forward-propagate with $W\odot M$ , and update $S$ using gradient descent with backpropagation “straight through” the masking operator. This process is used for both classical supermasks and signed/ternary variants (García-Arias et al., 2021, Koster et al., 2022).

Dynamic scoring and rescaling: In the stochastic mask variant, $M$ is sampled per pass from $\mathrm{Bernoulli}(\sigma(S))$ , and the kept weights are renormalized according to the number of active entries, reducing variance in signal propagation (Zhou et al., 2019).

Multicoated masks: In graph neural networks (GNNs), scalar pruning masks can be “stacked” at multiple thresholds, forming multicoated supermasks that expand the search space for robust subnetworks. Thresholds are adaptively determined per-layer based on the score distribution to avoid ineffective masks (Yan et al., 2023).

4. Extensions: Ternary Masks, Folding, and Continual Learning

Signed/ternary masks: Allowing $M_{ij}\in\{-1,0,1\}$ enables the mask to flip signs where the randomly-initialized weight’s sign is suboptimal. This is especially beneficial at extreme sparsity, enabling “rescues” of connections that would otherwise be suppressed (Koster et al., 2022). The signed supermask is constructed via score thresholding into positive, zero, or negative assignments; after training, the network uses $\widetilde{W}^l = W^l \odot M^l$ at inference.

Folded and recurrent architectures: In deep residual networks, stages can be folded (“hidden-folded”) into recurrent blocks, reusing weights and masks across iterations with unshared BatchNorm parameters. Only one binary mask per stage is learned, applied at each recurrent step; this approach, combined with supermask selection, compresses memory usage by $>25\times$ with minimal accuracy degradation on ImageNet and CIFAR-100 (García-Arias et al., 2021). Analogous principles apply to deep GNNs, where folding plus adaptive, multicoated masks achieve up to $98.7\%$ memory reduction (Yan et al., 2023).

Continual learning by mask superposition: The SupSup and ImpressLearn frameworks leverage supermasks for lifelong learning. Each task is allocated a dedicated mask, permitting the base network to be re-used. ImpressLearn further reduces memory by learning a sparse linear combination of previously discovered masks (“impressions”), replacing per-task mask costs with small vectors of combination coefficients—delivering $26\times$ – $67\times$ reduction in per-task memory on challenging benchmarks (Bhardwaj et al., 2022, Wortsman et al., 2020).

5. Retrieval, Task Inference, and Memory Compression

Task retrieval with known and unknown identity: For known tasks, the correct mask index suffices to activate the corresponding subnetwork. When task identity is unknown, mask selection can be reformulated as an optimization problem, finding the convex combination (belief vector $\alpha$ ) of all previously learned masks that minimizes output entropy. A single gradient/Frank-Wolfe step suffices in practice, even among thousands of masks (Wortsman et al., 2020).

Novelty detection and unlimited task capacity: If new data fails to be confidently classified by any existing supermask (measured via near-uniform softmax over mask posteriors), a new mask can be allocated for the new task. This enables fully unsupervised, task-agnostic lifelong learning (Wortsman et al., 2020).

Hopfield-style mask storage: To address the linear memory cost of storing thousands of masks, the set can be compressed as “attractors” in a fixed-size Hopfield network (i.e., storing bipolar representations of $M^i$ as attractors in a $d\times d$ associative matrix). Retrieval is then performed by joint minimization of Hopfield energy and task entropy; $O(30)$ gradient steps suffice to recover the correct mask (Wortsman et al., 2020).

6. Experimental Performance and Empirical Insights

Accuracy and sparsity trade-offs: Experiments reveal that:

Classical supermasks (binary) on MNIST FC: 86%–96% test accuracy with $>90\%$ sparsity, approaching dense network performance (Zhou et al., 2019).
Signed supermasks on CIFAR-10 with Conv8: 80.9% accuracy at only 1.2% keep rate, outperforming prior binary-mask supermasks and dense baselines (Koster et al., 2022).
Folded hidden-fold networks: $38.5\times$ – $48.7\times$ memory reduction with minimal (<1% absolute) loss in accuracy on CIFAR-100/ImageNet (García-Arias et al., 2021).
Multi-coat masks in deep GNNs: $98.7\%$ memory reduction with baseline matching accuracy, even for 28-layer ResGCN+ (Yan et al., 2023).
SupSup and ImpressLearn: on SplitCIFAR-100, ImpressLearn achieves 71.2% accuracy (ResNet-18) with only 210 floats overhead per task, compared to SupSup’s 1.5MB per mask (Bhardwaj et al., 2022).

Mask stability and interpretability: In tasks such as MNIST, signed supermask patterns are highly stable across runs—zero entries are near-deterministic in highly pruned layers, and inverted signs highlight key discriminative features (Koster et al., 2022).

Sign inversion and expressive capacity: The ability of ternary masking to “rescue” and flip connections is essential at extreme sparsity, providing a combinatorial expansion of the accessible function space (Koster et al., 2022).

7. Impact, Limitations, and Research Directions

Supermasks fundamentally alter the approach to neural network sparsification, task adaptation, and resource-constrained deployment. By demonstrating that mask discovery alone—often via straightforward, parallelizable score-based optimization—can rival or replace full weight training in certain regimes, they motivate a re-examination of optimization, transfer, and the role of parameter counts in deep learning.

Potential limitations include the reliance on highly overparameterized backbones, the task-specific selection of sparsity thresholds, sensitivity to initialization scaling (especially in signed/ternary schemes), and the potential for suboptimal performance on tasks with less alignment to random initializations. Adaptive thresholding strategies (as in multicoated masks) and architectural folding offer solutions for broader applicability (Yan et al., 2023, García-Arias et al., 2021).

Ongoing research investigates extensions to structured pruning, dynamic mask reallocation, improved continual learning via mask superposition, and the interpretability of combinatorial mask landscapes, as well as hardware acceleration for mask-only inference and ultra-lightweight model deployment regimes (Wortsman et al., 2020, Bhardwaj et al., 2022).

Supermasks thus represent both a practical framework for achieving state-of-the-art efficiency and an analytic probe into the structure and capacity of untrained, random neural networks.