Pruned and Orthogonal Subnetworks
- Pruned and orthogonal subnetworks are architectures that combine structured sparsification with enforced weight or activation orthogonality to eliminate redundancy.
- Techniques like OrthoReg, OrthCaps, and SNP use methods such as regularization, Householder transformations, and LDL projections to ensure accurate pruning decisions.
- Empirical results demonstrate that enforcing orthogonality leads to efficient inference, robust generalization, and significant reductions in parameters and FLOPs.
Pruned and orthogonal subnetworks are a family of neural network architectures or subnet extraction methodologies characterized by two properties: (1) structured sparsification via the removal of parameters, nodes, or computational primitives—yielding a subnetwork; and (2) explicit enforcement or construction of weight or activation orthogonality within the retained subspace. Such approaches have emerged to address redundancy, parameter inefficiency, and optimization pathologies in overparameterized models, enabling efficient inference, robust generalization, and accurate importance estimation for pruning decisions. Techniques in this family are applied across convolutional, capsule, and deep feedforward networks, relying on algorithmic innovations in orthogonal regularization, orthogonal projection, and data-driven subspace analysis.
1. Theoretical Motivation for Orthonormal Pruning
Traditional pruning strategies rely on the additive estimation of group importance under the assumption of independence among filters, weights, or units. However, empirical observations in modern, overparameterized networks reveal substantial inter-parameter correlation, violating this independence and rendering additive importance estimates unreliable. For example, in convolutional neural networks (CNNs), correlated filters cause cross-terms in loss approximations that bias pruning criteria, leading to suboptimal parameter removal and impaired retraining dynamics (Lubana et al., 2020). Orthogonality reduces or eliminates these correlations, restoring additivity and ensuring that group-wise pruning accurately reflects the true contribution to network loss or prediction.
In the loss landscape context, the “flat minimum valley” phenomenon, as explored in structured directional pruning (Li et al., 2021), motivates pruning along orthogonal directions tangent to the valley to avoid deviating from regions of low loss. Orthogonal projections ensure that post-pruning weights remain in or near the original optimizer’s basin, preserving trainability and eliminating the need for extended retraining.
2. Methods for Imposing Orthogonality
Orthogonality can be achieved either via explicit regularization during training, via parameterization of layer transformations, or by construction through orthogonal projections in the post-hoc analysis of activations or weights.
- Penalization-based regularization: OrthoReg imposes a Frobenius-norm penalty on the deviation of intra-layer weight matrices from orthonormality. For weights , the orthonormality penalty is
The total loss is then (Lubana et al., 2020).
- Strict orthogonal parameterization: In capsule networks (CapsNets), OrthCaps parameterizes the projection matrices in sparse attention routing as a product of Householder reflections:
which guarantees exactly (Geng et al., 2024).
- Gram–Schmidt/LDL-based orthogonalization: In subspace node pruning (SNP), for a layer with activation matrix , an orthogonal subspace is constructed by applying an LDL decomposition () and using . The projected activity yields orthogonal rows, and pruning is executed in this basis (Offergeld et al., 2024).
3. Pruning Subnetworks in Orthogonal Spaces
Pruning is made tractable and effective by leveraging orthogonality to decouple weights or activations, including:
- Group or filter pruning in orthonormal bases: Under OrthoReg, the first-order Taylor importance for filter is 0. With nearly orthogonal filters, group-wise importance sums are additive: 1, enabling simultaneous large-group pruning with minimal bias (Lubana et al., 2020).
- Capsule pruning via cosine redundancy: OrthCaps prunes primary capsules using a cosine similarity threshold 2: capsules with norm 3 and similarity 4 are pruned if less important. This reduces redundancy prior to each routing step (Geng et al., 2024).
- Orthogonal subspace projection for node ranking: In SNP, layer-wise activity is projected onto orthogonal rows; ordering is optimized (e.g., via ZCA-based heuristics) to maximize cumulative variance in the kept directions. A target variance or parameter budget determines how many top-variance orthogonal units are retained, with the impact of pruned units reconstructed via linear least squares (Offergeld et al., 2024).
- Directional pruning via orthogonal projection: Structured Directional Pruning (SDP) solves
5
where 6 are direction factors derived from projecting parameter groups onto the tangent space of the flat minimum. The AltSDP solver interleaves SGD-like updates with group shrinkage adjusted along orthogonal directions (Li et al., 2021).
4. Algorithms and Workflows
Representative pipelines involve:
| Approach | Orthogonalization Mechanism | Pruning Step |
|---|---|---|
| OrthoReg | Loss-augmented orthonormal penalty | Taylor importance, group-additive, iterative |
| OrthCaps | Exact Householder product | Cosine similarity threshold, Entmax routing |
| SNP | LDL (unnormalized GS) on activations | Variance-based orthogonal projection pruning |
| SDP/AltSDP | Hessian/projector via dynamics | Tangent-space group shrinkage |
- OrthoReg (Lubana et al., 2020): Pretrain → fine-tune with 7 → prune based on Taylor scores → fine-tune remaining rounds → retrain final network. Large early pruning (30–50%) is enabled by orthogonality, reducing rounds from ∼5 to ∼2.
- OrthCaps (Geng et al., 2024): Prune primary capsules based on norm and cosine similarity → replace dynamic routing with orthogonal, one-shot attention → enforce orthogonality on projection matrices via Householder decomposition → achieve strong accuracy with <2% of standard parameters.
- SNP (Offergeld et al., 2024): Project activations per-layer via optimized, order-sensitive LDL/GS → prune lowest-variance orthogonal units → reconstruct effects with linear regression; automatic proportioning via cumulative explained variance.
- SDP/AltSDP (Li et al., 2021): Alternate SGD with group shrinkage in directions tangent to flat minima; design ensures weights pruned along nearly-loss-invariant axes; proven same-valley behavior as baseline optimizer solution.
5. Empirical Results and Comparative Performance
Orthogonality-based pruning yields consistent improvements in both efficiency and accuracy across models and tasks:
- OrthoReg results (Lubana et al., 2020): For VGG-13, MobileNet-V1, ResNet-34 on CIFAR-100 and Tiny-ImageNet:
- 65% pruning (ResNet-34, CIFAR-100): 74.1% (OrthoReg) vs. 73.2% (Fisher)
- 60% pruning (ResNet-34, Tiny-ImageNet): 54.7% vs. 52.7%
- Early-bird ticket setting: OrthoReg finds subnetworks matching or exceeding baseline accuracy with single-shot pruning after few epochs.
- OrthCaps (Geng et al., 2024): OrthCaps-Shallow (∼1.25% standard parameters) achieves 99.68% (MNIST), 86.84% (CIFAR10); OrthCaps-Deep (<1.5% ResNet-18 parameters) with 90.56% (CIFAR10). Ablations confirm both pruning and orthogonal routing are critical for best results.
- SNP (Offergeld et al., 2024): On ImageNet VGG-16, up to 50–60% parameter reduction with under 1% Top-1 drop, outperforming SAW and unstructured pruning; ResNet-50 with 30–40% FLOP reduction for <1% Top-1 loss.
- SDP/AltSDP (Li et al., 2021): Up to 55% FLOPs reduction on CIFAR-10/ResNet-56 with ≤0.1% accuracy drop; no retraining post-pruning required.
6. Architectural Generality and Variants
Pruned and orthogonal subnetworks span multiple architectures:
- CNNs: Most orthogonality-based methods have been demonstrated on convolutional layers; OrthoReg and SNP impose orthogonality on filters or activations, achieving layer-wise or network-wide pruning.
- Capsule Networks: OrthCaps demonstrates that redundancy and computational cost in dynamic routing can be controlled only when orthogonality is enforced end-to-end, not just post-initial-layer.
- Deep MLPs/Feedforward: Directional pruning and subspace orthogonal projections are architecture agnostic and can be applied at arbitrary layers with group-wise or node-wise sparsity.
7. Interpretability, Limitations, and Open Directions
Orthonormality and orthogonal projection clarify group importance estimates and make pruning heuristics more transparent to analysis. Empirical evidence supports the contention that orthogonal subnetworks preserve task-relevant signals and maintain “dynamical isometry” for gradient propagation. A notable limitation is the computational overhead where explicit orthogonalization requires matrix factorizations or maintenance of transformation matrices, though practical heuristics (batch-wise LDL, Householder products) mitigate this.
A plausible implication is that future directions will further integrate data-driven subspace discovery, orthogonality-promoting architectures, and loss landscape geometry in large-scale networks for robust sparsification and transferability. Optimal ordering in orthogonalization and more nuanced quantification of “variance explained” per subspace remain active areas.
For detailed algorithms, implementation guidelines, activation order heuristics, and empirical layer-wise ablations, see (Lubana et al., 2020) (OrthoReg), (Geng et al., 2024) (OrthCaps), (Li et al., 2021) (SDP/AltSDP), and (Offergeld et al., 2024) (SNP).