ThriftyNets: Deep Learning with Shared Parameters
- ThriftyNets are deep learning architectures defined by maximal parameter sharing, where a single convolutional weight tensor is recursively applied to reduce model size.
- They incorporate shortcut histories and adaptive downsampling to balance expressivity with computational efficiency and parameter budget.
- Empirical results on benchmarks like CIFAR-10/100 show that ThriftyNets achieve competitive accuracy with significantly fewer parameters than traditional CNNs.
ThriftyNets denote a modern deep learning architecture class distinguished by maximal parameter sharing, achieving high classification accuracy with sharply reduced parameter counts. In the canonical ThriftyNet model, only a single convolutional layer weight tensor is defined, which is applied recursively across multiple processing steps, with additional mechanisms—nonlinearity, normalization, downsampling, and shortcut connections—ensuring sufficient expressivity. ThriftyNets depart from conventional practice, where network depth and width typically increase parameter count, instead leveraging recurrence and parameter factorization to attain competitive performance under severe parameter budget constraints (Coiffier et al., 2020). The term "ThriftyNet" also appears independently in the context of wireless system design, where it concerns resource-efficient networking by aligning physical throughput with application-level goodput, often through advanced coding and transport-layer design (Kim et al., 2012).
1. ThriftyNet Architecture and Parameter Factorization
The ThriftyNet architecture is defined by the recursive application of a single convolutional-layer weight tensor, , across iterations. The input image, padded to match channels, is processed as
and evolved via
with ReLU/tanh activations, step-dependent BatchNorm (with $2f$ affine parameters per step), and optional downsampling . Parameter factorization yields a convolutional-layer budget of only (ungrouped), maintained constant across all processing steps.
2. Residual Connections and Shortcut History
Expressivity and trainability are further enhanced via shortcut histories. For history length , learnable scalars combine up to past activations at each step: This incurs only additional scalar parameters. Ablations show that learning real-valued is essential: hard-thresholding to binarized values degrades accuracy by 1–2 points.
3. Evolution of Feature Geometry and Parameter Budget
Throughout, the channel count is kept constant at ; spatial resolution evolves as
where downsampling ratios are scheduled as needed. Downsampling can be applied regularly or irregularly, allowing explicit control over the MACs (multiply-accumulate operations) and feature map size. The full parameter count (excluding grouped convolutions) is: Grouped/depthwise convolutions further reduce the dominant term to .
4. Training and Inference Procedures
ThriftyNet models are trained with cross-entropy loss, SGD (learning rate $0.1$, momentum $0.9$), no explicit weight decay, and a step-down schedule at epochs $50$, $100$, and $150$ for a total of $200$ epochs. Data augmentation via random cropping and horizontal flipping is found to deliver most of the attainable gains; more aggressive augmentation (AutoAugment, mixup, cutmix, cutout) yields marginal or negative returns, as shown in the table below:
| Augmentation | Test Accuracy (%) |
|---|---|
| standard (flip + crop) | 90.64 |
| + auto augment | 91.00 |
| + cutout (size=8) | 90.40 |
| + mixup | 88.09 |
| + cutmix | 88.47 |
Inference involves sequentially running all recursions; resource/accuracy trade-offs can be managed by adjusting or pooling schedules.
5. Empirical Performance and Ablation Analyses
Results on CIFAR-10/100 establish that ThriftyNet achieves parity with much larger networks at a fraction of their parameter budget. For CIFAR-10, with approximately parameters:
- ThriftyNet (, ): \%
- Residual ThriftyNet (, ): \%
- ResNet-20: $91.25$\% (using params)
For CIFAR-100:
- Residual ThriftyNet (, ): $74.37$\% ( parameters)
- DenseNet-BC (L=100, K=12): $77.73$\% ( parameters)
- Wide-ResNet-16-8: $69.11$\% ( parameters)
Ablations show moderate sensitivity to in the range (for fixed budget), clear gains from increased downsamplings, and strong dependence of accuracy on number of filters . For example, increasing from $32$ ($3$k params) to $256$ ($80$k params) on CIFAR-10 boosts accuracy from to .
| Iteration | ThriftyNet Acc. (%) | Residual Acc. (%) |
|---|---|---|
| 10 | 89.2 | 89.3 |
| 20 | 90.47 | 90.71 |
| 45 | 90.18 | 90.95 |
| Downsamplings | ThriftyNet Acc. (%) | Residual Acc. (%) |
|---|---|---|
| 1 | 89.10 | 90.62 |
| 4 | 90.93 | 91.69 |
6. Computational and Design Trade-offs
ThriftyNets allow fine-grained adjustment between parameter count, , and computational cost (MACs) to target specific hardware profiles. Pooling can be front-loaded to halve computational cost at a tolerable accuracy loss (4–5 points). Grouped convolutions are available to further shrink the parameter budget. The principal trade-off is between parameter reuse (T), accuracy, and computational efficiency.
7. Broader Context and Related Work
ThriftyNet's maximal parameter factorization contrasts with deepening/widening used in standard CNNs and may be viewed as an instance of recurrent convolutional networks with weight sharing. The architecture is distinct from parameter-pruning or quantization approaches, as it operates directly through strict model weight reuse. The term "ThriftyNet" is also encountered in wireless networking literature—there, a "ThriftyNet" is defined as one whose infrastructure is dimensioned by the user-perceived goodput, not peak physical throughput, and exploits coding techniques such as TCP/NC to maintain user-level performance while minimizing physical plant and energy costs. The key principle in both contexts is resource efficiency: in neural networks, via maximal parameter reuse; in wireless networks, via protocols that close the gap between throughput and goodput, minimizing required infrastructural spend (Coiffier et al., 2020, Kim et al., 2012).