Stack-Based Pruning Algorithm

Updated 10 February 2026

Stack-Based Pruning Algorithm is a method that trims candidate decoding paths in polar/PAC codes and prunes redundant layers in deep models.
It employs precise statistical thresholds, such as Chernoff bounds and cosine similarity metrics, to retain only high-quality candidates.
This approach achieves significant complexity gains, with over 90% reduction in decoding paths and up to 45% pruning in transformer layers, while maintaining performance.

A stack-based pruning algorithm is a class of complexity-reduction techniques in sequential path-search decoders or deep learning architectures, where a stack or ordered heap of candidate hypotheses is dynamically pruned using per-branch metrics or redundancy measures. In coding theory, these methods are deployed in the decoding of polar codes and PAC (Polarization-Adjusted Convolutional) codes to discard statistically unlikely or demonstrably redundant decoding paths, while in neural networks the principle extends to pruning redundant layers or computational sub-blocks. Central to modern stack-based pruning is the definition of rigorous pruning metrics, statistically motivated thresholds, and guarantees that the optimal solution is almost never pruned.

1. Theoretical Foundations of Stack-Based Pruning

In sequential decoding of polar or PAC codes, a stack decoder explores a tree of possible codewords by expanding candidate paths ranked by a path metric. Each partial path $\mathbf{u}^i=(u_1,...,u_i)$ at depth $i$ is scored by a path metric

$\Gamma(\mathbf{u}^i; \mathbf{y}) = \sum_{j=1}^i \gamma_j(u_j; \mathbf{y}^N,\mathbf{u}^{j-1}),$

where $\gamma_j$ is a local bit-metric, derived as

$\gamma_j(u_j; \mathbf{y}^N, \mathbf{u}^{j-1}) = \log_2 \frac{P(\mathbf{y}^N, \mathbf{u}^{j-1} \mid u_j)}{P(\mathbf{y}^N, \mathbf{u}^{j-1})} - E_0(1,W_N^{(j)}).$

This metric distinguishes between paths aligned with the true codeword (the “correct” path) and incorrect paths by comparing the metric value to the channel’s cutoff rate $E_0(1,W_N^{(j)})$ or the symmetric capacity $I(W_N^{(j)})$ (Moradi et al., 2022). For deep transformer models, a stack of layers is assessed for redundancy by similarity metrics and pruned by removal or replacement as entire computational blocks (Dorszewski et al., 2024).

2. Pruning Metrics and Threshold Criteria

Stack-based pruning algorithms rely on tightly controlled thresholding of path metrics to safely reduce the search space. In polar/PAC decoding, the unnormalized bit-metric $\phi_j(u_j)$ for branch $j$ is:

$\phi_j(u_j) = \log_2 \frac{P(\mathbf{y}^N, \mathbf{u}^{j-1}\mid u_j)}{P(\mathbf{y}^N, \mathbf{u}^{j-1})}$

with expected value $I(W_N^{(j)})$ on the correct path and $\leq 0$ on wrong branches. The algorithm prunes any partial path where $\phi_j$ falls below a threshold $m_T < I(W_N^{(j)})$ , capitalizing on the exponential gap between correct-path and incorrect-path metrics. Statistical bounds such as one-sided Chernoff or Chebyshev inequalities provide explicit exponential decay for the probability of mistakenly pruning the correct path:

$P\{\phi_j(U_j) \leq m_T\} \leq 2^{m_T - E_0(1,W_N^{(j)})}$

(Moradi et al., 2022, Moradi et al., 8 Sep 2025).

For deep learning stacks, redundancy across layers is measured by cosine similarity, linear centered kernel alignment (CKA), or mutual nearest-neighbor (kNN) alignment, identifying blocks of highly similar or functionally redundant layers. Layers within such blocks are prime targets for pruning (Dorszewski et al., 2024).

3. Algorithmic Structures and Pseudocode

In polar/PAC decoding, the stack-based pruning algorithm proceeds as follows:

Initialize the stack with the empty path.
Repeatedly pop the top-ranked candidate.
For each successor:
- Compute the local bit-metric.
- Prune (discard) the candidate unless the metric exceeds the adaptive per-bit threshold.
- Update metric and push surviving candidates into the stack.
Re-sort the stack and truncate to the permitted maximum size.

Pseudocode, specialized for fast-decodable node types (e.g., rate-0, REP, type-IV, rate-1 nodes), further reduces complexity by admitting into the stack only those candidate paths whose aggregate bit-metrics exceed a threshold based on mean/variance approximations and the desired pruning error (Moradi et al., 8 Sep 2025).

For transformer models, pruning proceeds by computing similarity matrices across all $L$ layers, identifying block structures, and iteratively deleting the lowest-influence layers using either block-influence heuristics or explicit kNN-similarity reduction until a minimum target accuracy is reached. All steps may be performed post hoc, without retraining (Dorszewski et al., 2024).

4. Statistical Guarantees and Safe Pruning Regimes

Chernoff bound analysis in the coding context provides a quantifiable guarantee that the probability of pruning the correct path falls exponentially with the difference $E_0(1,W_N^{(j)}) - m_T^{(j)}$ . Total pruning failure probability over all $N$ stages is then

$P_{\mathrm{total}} \leq \sum_{i=1}^N 2^{m_T^{(i)} - E_0(1,W_N^{(i)})}$

which remains negligible provided thresholds are set below the cutoff rates (Moradi et al., 2022). In the variance-guided pruning regime introduced for PAC codes, per-bit thresholds are established such that the probability of pruning the correct path is bounded by either a Chebyshev or Chernoff tail, ensuring the probability can be made arbitrarily small while only keeping candidates statistically close to the expected mean bit-metric (Moradi et al., 8 Sep 2025).

In transformer model pruning, catastrophic degradation is only observed when an entire block is removed, affirming that at least some representation from each detected functional block must be preserved to retain performance. Pruning within blocks yields virtually no drop in predictive capacity due to within-block redundancy (Dorszewski et al., 2024).

5. Practical Complexity Gains

Substantial reductions in practical computational complexity have been documented. For a PAC(128,64) code, pruning with $m_T=-20$ reduced average stack occupancy from $\approx67$ (conventional) to $\approx6.6$ (pruned), i.e., a $>90\%$ reduction, with no compromise in frame-error-rate (FER) across a range of SNRs (Moradi et al., 2022). The variance-guided stack decoder achieved up to $70\%$ reduction in average number of paths with zero performance loss and further reduced heap operations and LLR evaluations (Moradi et al., 8 Sep 2025).

In transformer-based speech models, up to $40-45\%$ of layers can be pruned post-training with $\geq95\%$ original accuracy maintained. Further replacement of the entire stack with mimicking layers resulted in $95–98\%$ parameter reduction and $87–94\%$ inference speedup, while maintaining nearly all predictive capability for downstream tasks (Dorszewski et al., 2024).

6. Design Assumptions, Algorithmic Limitations, and Parameter Selection

Stack-based pruning assumes an infinite or sufficiently large stack to avoid path loss due to overflow. Metric thresholding presumes offline computation and storage of channel parameters such as $E_0(1,W_N^{(j)})$ and $I(W_N^{(j)})$ . In short block-length or very low SNR regimes, thresholds must be set conservatively to avoid performance loss as the gap between $E_0$ and capacity may be substantial (Moradi et al., 2022). For variance-guided pruning, Gaussian approximations of polarized bit-channel means and variances enable fast, per-frame threshold computation, ensuring per-path adaptation (Moradi et al., 8 Sep 2025).

For deep model pruning, no retraining is required for layer removal unless entire functionally distinct blocks are removed. However, for best results, especially in stack replacement by mimicking networks, a two-stage knowledge distillation or matching loss optimization is advisable (Dorszewski et al., 2024).

7. Broader Implications and Interconnections

Stack-based pruning strategies highlight the universality of metric-concentration phenomena in both information-theoretic coding and neural network architectures. In sequential decoding, metric-driven pruning enables orders-of-magnitude complexity savings while guaranteeing exponentially small error. In deep learning, stack redundancy analysis and judicious removal or replacement of layers can yield radically leaner architectures. Across domains, the underlying principle is the exploitation of statistical predictability and redundancy to preserve optimal solutions while discarding vast numbers of provably suboptimal candidates (Moradi et al., 2022, Moradi et al., 8 Sep 2025, Dorszewski et al., 2024).