Filter Skeleton (FS) in Neural Networks
- Filter Skeleton (FS) is a learnable scaling matrix that modulates individual sub-filter weights to enable dynamic pruning and refined feature aggregation in neural architectures.
- It enables stripe-wise pruning in CNNs and Gaussian-based topology adjustment in GCNs, offering fine-grained control over parameter sparsity and structural refinement.
- Empirical results show that FS methods achieve significant compression with minimal accuracy loss, facilitating hardware-efficient deployment and robust performance.
A filter skeleton (FS) is a learnable, structured scaling matrix integrated into neural architectures to modulate feature aggregation and enable dynamic pruning or selection of local structures within a high-dimensional filter or graph. The FS concept addresses the challenge of achieving fine-grained control over parameter sparsity or topological refinement while maintaining inference efficiency and strong empirical performance. It is central to both Stripe-Wise Pruning (SWP) for convolutional neural networks (CNNs), where FSs enable sub-filter granularity compression, and to Gaussian-based skeletal topology refinement in advanced gated graph convolutional networks (GCNs), where FS mechanisms regularize feature aggregation on skeleton graphs.
1. Mathematical Definitions and Core Mechanisms
In the context of convolutional neural networks, consider a convolutional layer with weight tensor , where is the number of output filters, is the input channel dimension, and the spatial kernel width. Each filter of shape is decomposed into stripes—1×1 spatial sub-filters indexed by position and associated vector . The filter skeleton assigns a single scalar to each stripe within each filter.
The effective weights for layer during forward propagation are given by: This multiplicative modulation enables dynamic attenuation or enhancement of individual stripes inside conventional filters.
On graph-structured data, as in G³CN for skeleton-based action recognition, a skeleton's topological adjacency is refined through a "Gaussian filter skeleton." The joint distance matrix encodes shortest-path distances between graph nodes (joints); a Gaussian kernel provides a soft, spatial weighting function. Integration of these filter skeletons yields a topology-refined adjacency matrix used for downstream gated aggregation: with further learned normalization and projection steps to complete the skeleton-based correction of adjacency.
2. Filter Skeleton Learning and Sparsification
Learning of the FS involves optimization over both the standard weights and the skeleton entries. In SWP, an penalty is imposed on each , encouraging sparsity: The FS parameters act as a differentiable mask, with gradients computed as:
After optimization, all stripes with are pruned (frozen to zero), and the FS can be merged into the weights for deployment efficiency.
In Gaussian skeleton graph models, the FS learning is implicit in the propagation of gradients through analytically defined Gaussian kernels and through the end-to-end training of adjacency correction parameters.
3. Comparison with Classical Pruning and Aggregation Methods
Traditional filter pruning in CNNs removes entire filters (dimension per filter), sacrificing both flexibility and representational capacity. The FS mechanism in SWP increases pruning granularity from $1$ to per filter, allowing intermediate retention of salient spatial substructures. This K²-fold increase in pruning resolution outperforms both group-wise and channel-wise approaches, neither of which adaptively learns individualized sub-filter shapes.
Similarly, in the G³CN framework, classical GCNs aggregate information using a fixed skeletal graph or uniformly parameterized adjacency matrices. Gaussian filter skeletons refine these topologies with learnable, sample-adaptive spatial weights, effectively "sculpting" the receptive field for each action class, and yielding superior differentiation of ambiguous human actions.
4. Algorithmic Workflows
A canonical procedure using FS in SWP for CNN pruning involves:
- Initialization of for all layers.
- Joint training of and with sparsity regularization.
- Pruning of stripes where .
- Merging of into (i.e., ) and discarding .
- Optional light fine-tuning for recovery of potential accuracy loss.
- Inference is conducted via standard convolution, omitting pruned stripes.
For G³CN, the block structure alternates temporal convolution with spatial skeleton filtering, where the adjacency is dynamically refined using Gaussian-parameterized skeletons followed by gated message passing.
5. Empirical Performance and Hardware Efficiency
Filter skeleton-based SWP achieves state-of-the-art compression ratios without specialized hardware or custom kernels. For instance, on CIFAR-10 with a VGG16 backbone, SWP prunes 92.66% of parameters and 71.16% of FLOPs with only a 0.40% drop in accuracy (93.25% → 92.85%). On ResNet18/ImageNet, 54.6% of FLOPs are removed with negligible (<0.2%) or even negative (improved) top-1 accuracy change (Meng et al., 2020).
In skeleton-based GCNs, filter skeleton mechanisms result in substantial gains for ambiguous action classes (e.g., "Playing with phone/tablet": +8.2%; "Wear a shoe": +10.0% top-1 accuracy on NTU-60 cross-subject) and consistent overall improvements when plugged into diverse GCN backbones (Ren et al., 9 Sep 2025).
6. Structural Impact and Contextual Significance
By decoupling parameter importance at the sub-filter or sub-graph level, filter skeletons facilitate both adaptive model compression and interpretable topology refinement. FS modules serve as highly parameter-efficient surrogates for explicit mask learning or fixed manual pruning critera. Filters trained under the FS regime exhibit sparser and empirically more stable weight distributions; notably, even when base weights are randomly fixed, learning only the FS can yield high classification accuracy, underscoring the representational power inherent to the skeleton itself (Meng et al., 2020).
In graph domains, incorporating a Gaussian filter skeleton sharpens weak spatial correlations, leading to sparse, selective feature aggregation and improved robustness for challenging classification scenarios.
7. Practical Deployment and Future Directions
FS-based methods, by operating at a hardware-compatible granularity and requiring only minor record-keeping (e.g., indices for retained stripes per layer), enable straightforward integration into existing inference pipelines without the need for custom acceleration libraries. A plausible implication is that further extensions of FS concepts—toward multi-dimensional or content-adaptive masking—could yield new architectures balancing interpretability, sparsity, and universality for both vision and structured data modalities. Moreover, the empirical observation that FS structure alone can drive meaningful predictions suggests a potential research avenue in explicit structural learning and modular network design.