Pruned and Orthogonal Subnetworks

Updated 15 April 2026

Pruned and orthogonal subnetworks are architectures that combine structured sparsification with enforced weight or activation orthogonality to eliminate redundancy.
Techniques like OrthoReg, OrthCaps, and SNP use methods such as regularization, Householder transformations, and LDL projections to ensure accurate pruning decisions.
Empirical results demonstrate that enforcing orthogonality leads to efficient inference, robust generalization, and significant reductions in parameters and FLOPs.

Pruned and orthogonal subnetworks are a family of neural network architectures or subnet extraction methodologies characterized by two properties: (1) structured sparsification via the removal of parameters, nodes, or computational primitives—yielding a subnetwork; and (2) explicit enforcement or construction of weight or activation orthogonality within the retained subspace. Such approaches have emerged to address redundancy, parameter inefficiency, and optimization pathologies in overparameterized models, enabling efficient inference, robust generalization, and accurate importance estimation for pruning decisions. Techniques in this family are applied across convolutional, capsule, and deep feedforward networks, relying on algorithmic innovations in orthogonal regularization, orthogonal projection, and data-driven subspace analysis.

1. Theoretical Motivation for Orthonormal Pruning

Traditional pruning strategies rely on the additive estimation of group importance under the assumption of independence among filters, weights, or units. However, empirical observations in modern, overparameterized networks reveal substantial inter-parameter correlation, violating this independence and rendering additive importance estimates unreliable. For example, in convolutional neural networks (CNNs), correlated filters cause cross-terms in loss approximations that bias pruning criteria, leading to suboptimal parameter removal and impaired retraining dynamics (Lubana et al., 2020). Orthogonality reduces or eliminates these correlations, restoring additivity and ensuring that group-wise pruning accurately reflects the true contribution to network loss or prediction.

In the loss landscape context, the “flat minimum valley” phenomenon, as explored in structured directional pruning (Li et al., 2021), motivates pruning along orthogonal directions tangent to the valley to avoid deviating from regions of low loss. Orthogonal projections ensure that post-pruning weights remain in or near the original optimizer’s basin, preserving trainability and eliminating the need for extended retraining.

2. Methods for Imposing Orthogonality

Orthogonality can be achieved either via explicit regularization during training, via parameterization of layer transformations, or by construction through orthogonal projections in the post-hoc analysis of activations or weights.

Penalization-based regularization: OrthoReg imposes a Frobenius-norm penalty on the deviation of intra-layer weight matrices from orthonormality. For weights $W^{(l)}\in\mathbb{R}^{d_l\times M_l}$ , the orthonormality penalty is

$R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$

The total loss is then $L_{\text{total}} = L_{\text{task}} + \lambda R_{\text{orth}}$ (Lubana et al., 2020).

Strict orthogonal parameterization: In capsule networks (CapsNets), OrthCaps parameterizes the projection matrices in sparse attention routing as a product of Householder reflections:

$W = H_0 H_1 \cdots H_{d-1},\quad H_i = I - 2 \frac{b_i b_i^T}{\lVert b_i \rVert_2^2}$

which guarantees $W^T W = I$ exactly (Geng et al., 2024).

Gram–Schmidt/LDL-based orthogonalization: In subspace node pruning (SNP), for a layer with activation matrix $X\in\mathbb{R}^{n\times s}$ , an orthogonal subspace is constructed by applying an LDL decomposition ( $C = XX^T = LDL^T$ ) and using $M = L^{-1}$ . The projected activity $\hat X = M X$ yields orthogonal rows, and pruning is executed in this basis (Offergeld et al., 2024).

3. Pruning Subnetworks in Orthogonal Spaces

Pruning is made tractable and effective by leveraging orthogonality to decouple weights or activations, including:

Group or filter pruning in orthonormal bases: Under OrthoReg, the first-order Taylor importance for filter $w_i$ is $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 0. With nearly orthogonal filters, group-wise importance sums are additive: $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 1, enabling simultaneous large-group pruning with minimal bias (Lubana et al., 2020).
Capsule pruning via cosine redundancy: OrthCaps prunes primary capsules using a cosine similarity threshold $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 2: capsules with norm $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 3 and similarity $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 4 are pruned if less important. This reduces redundancy prior to each routing step (Geng et al., 2024).
Orthogonal subspace projection for node ranking: In SNP, layer-wise activity is projected onto orthogonal rows; ordering is optimized (e.g., via ZCA-based heuristics) to maximize cumulative variance in the kept directions. A target variance or parameter budget determines how many top-variance orthogonal units are retained, with the impact of pruned units reconstructed via linear least squares (Offergeld et al., 2024).
Directional pruning via orthogonal projection: Structured Directional Pruning (SDP) solves

$R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 5

where $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 6 are direction factors derived from projecting parameter groups onto the tangent space of the flat minimum. The AltSDP solver interleaves SGD-like updates with group shrinkage adjusted along orthogonal directions (Li et al., 2021).

4. Algorithms and Workflows

Representative pipelines involve:

Approach	Orthogonalization Mechanism	Pruning Step
OrthoReg	Loss-augmented orthonormal penalty	Taylor importance, group-additive, iterative
OrthCaps	Exact Householder product	Cosine similarity threshold, Entmax routing
SNP	LDL (unnormalized GS) on activations	Variance-based orthogonal projection pruning
SDP/AltSDP	Hessian/projector via dynamics	Tangent-space group shrinkage

OrthoReg (Lubana et al., 2020): Pretrain → fine-tune with $R_{\text{orth}} = \sum_{l=1}^{L} \alpha(l) \lVert (W^{(l)})^T W^{(l)} - I_{M_l} \rVert_F^2$ 7 → prune based on Taylor scores → fine-tune remaining rounds → retrain final network. Large early pruning (30–50%) is enabled by orthogonality, reducing rounds from ∼5 to ∼2.
OrthCaps (Geng et al., 2024): Prune primary capsules based on norm and cosine similarity → replace dynamic routing with orthogonal, one-shot attention → enforce orthogonality on projection matrices via Householder decomposition → achieve strong accuracy with <2% of standard parameters.
SNP (Offergeld et al., 2024): Project activations per-layer via optimized, order-sensitive LDL/GS → prune lowest-variance orthogonal units → reconstruct effects with linear regression; automatic proportioning via cumulative explained variance.
SDP/AltSDP (Li et al., 2021): Alternate SGD with group shrinkage in directions tangent to flat minima; design ensures weights pruned along nearly-loss-invariant axes; proven same-valley behavior as baseline optimizer solution.

5. Empirical Results and Comparative Performance

Orthogonality-based pruning yields consistent improvements in both efficiency and accuracy across models and tasks:

OrthoReg results (Lubana et al., 2020): For VGG-13, MobileNet-V1, ResNet-34 on CIFAR-100 and Tiny-ImageNet:
- 65% pruning (ResNet-34, CIFAR-100): 74.1% (OrthoReg) vs. 73.2% (Fisher)
- 60% pruning (ResNet-34, Tiny-ImageNet): 54.7% vs. 52.7%
- Early-bird ticket setting: OrthoReg finds subnetworks matching or exceeding baseline accuracy with single-shot pruning after few epochs.
OrthCaps (Geng et al., 2024): OrthCaps-Shallow (∼1.25% standard parameters) achieves 99.68% (MNIST), 86.84% (CIFAR10); OrthCaps-Deep (<1.5% ResNet-18 parameters) with 90.56% (CIFAR10). Ablations confirm both pruning and orthogonal routing are critical for best results.
SNP (Offergeld et al., 2024): On ImageNet VGG-16, up to 50–60% parameter reduction with under 1% Top-1 drop, outperforming SAW and unstructured pruning; ResNet-50 with 30–40% FLOP reduction for <1% Top-1 loss.
SDP/AltSDP (Li et al., 2021): Up to 55% FLOPs reduction on CIFAR-10/ResNet-56 with ≤0.1% accuracy drop; no retraining post-pruning required.

6. Architectural Generality and Variants

Pruned and orthogonal subnetworks span multiple architectures:

CNNs: Most orthogonality-based methods have been demonstrated on convolutional layers; OrthoReg and SNP impose orthogonality on filters or activations, achieving layer-wise or network-wide pruning.
Capsule Networks: OrthCaps demonstrates that redundancy and computational cost in dynamic routing can be controlled only when orthogonality is enforced end-to-end, not just post-initial-layer.
Deep MLPs/Feedforward: Directional pruning and subspace orthogonal projections are architecture agnostic and can be applied at arbitrary layers with group-wise or node-wise sparsity.

7. Interpretability, Limitations, and Open Directions

Orthonormality and orthogonal projection clarify group importance estimates and make pruning heuristics more transparent to analysis. Empirical evidence supports the contention that orthogonal subnetworks preserve task-relevant signals and maintain “dynamical isometry” for gradient propagation. A notable limitation is the computational overhead where explicit orthogonalization requires matrix factorizations or maintenance of transformation matrices, though practical heuristics (batch-wise LDL, Householder products) mitigate this.

A plausible implication is that future directions will further integrate data-driven subspace discovery, orthogonality-promoting architectures, and loss landscape geometry in large-scale networks for robust sparsification and transferability. Optimal ordering in orthogonalization and more nuanced quantification of “variance explained” per subspace remain active areas.

For detailed algorithms, implementation guidelines, activation order heuristics, and empirical layer-wise ablations, see (Lubana et al., 2020) (OrthoReg), (Geng et al., 2024) (OrthCaps), (Li et al., 2021) (SDP/AltSDP), and (Offergeld et al., 2024) (SNP).

Markdown Report Issue Upgrade to Chat

References (4)

OrthoReg: Robust Network Pruning Using Orthonormality Regularization (2020)

Structured Directional Pruning via Perturbation Orthogonal Projection (2021)

OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning (2024)

Subspace Node Pruning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pruned and Orthogonal Subnetworks.