Papers
Topics
Authors
Recent
2000 character limit reached

ThriftyNets: Deep Learning with Shared Parameters

Updated 9 December 2025
  • ThriftyNets are deep learning architectures defined by maximal parameter sharing, where a single convolutional weight tensor is recursively applied to reduce model size.
  • They incorporate shortcut histories and adaptive downsampling to balance expressivity with computational efficiency and parameter budget.
  • Empirical results on benchmarks like CIFAR-10/100 show that ThriftyNets achieve competitive accuracy with significantly fewer parameters than traditional CNNs.

ThriftyNets denote a modern deep learning architecture class distinguished by maximal parameter sharing, achieving high classification accuracy with sharply reduced parameter counts. In the canonical ThriftyNet model, only a single convolutional layer weight tensor is defined, which is applied recursively across multiple processing steps, with additional mechanisms—nonlinearity, normalization, downsampling, and shortcut connections—ensuring sufficient expressivity. ThriftyNets depart from conventional practice, where network depth and width typically increase parameter count, instead leveraging recurrence and parameter factorization to attain competitive performance under severe parameter budget constraints (Coiffier et al., 2020). The term "ThriftyNet" also appears independently in the context of wireless system design, where it concerns resource-efficient networking by aligning physical throughput with application-level goodput, often through advanced coding and transport-layer design (Kim et al., 2012).

1. ThriftyNet Architecture and Parameter Factorization

The ThriftyNet architecture is defined by the recursive application of a single convolutional-layer weight tensor, WRf×f×a×b\mathbf{W} \in \mathbb{R}^{f \times f \times a \times b}, across TT iterations. The input image, padded to match ff channels, is processed as

x0=PAD(x)Rf×H×W\mathbf{x}_0 = \mathrm{PAD}(\mathbf{x}) \in \mathbb{R}^{f \times H \times W}

and evolved via

at+1=Wxt, st+1=xt+σ(at+1), yt+1=BNt(st+1), xt+1=Dt(yt+1),\begin{aligned} \mathbf{a}_{t+1} &= \mathbf{W} \star \mathbf{x}_t, \ \mathbf{s}_{t+1} &= \mathbf{x}_t + \sigma(\mathbf{a}_{t+1}), \ \mathbf{y}_{t+1} &= \mathrm{BN}_t(\mathbf{s}_{t+1}), \ \mathbf{x}_{t+1} &= \mathcal{D}_t(\mathbf{y}_{t+1}), \end{aligned}

with ReLU/tanh activations, step-dependent BatchNorm (with $2f$ affine parameters per step), and optional downsampling Dt\mathcal{D}_t. Parameter factorization yields a convolutional-layer budget of only f2abf^2ab (ungrouped), maintained constant across all processing steps.

2. Residual Connections and Shortcut History

Expressivity and trainability are further enhanced via shortcut histories. For history length hh, learnable scalars αt,i\alpha_{t,i} combine up to hh past activations at each step: bt+1=Dt(at+1)+i=0min(t,h)αt,i(j=0jDti(xti)), xt+1=BNt(bt+1)\begin{aligned} \mathbf{b}_{t+1} &= \mathcal{D}_t(\mathbf{a}_{t+1}) + \sum_{i=0}^{\min(t,h)} \alpha_{t,i} \left(\circ_{j=0}^j \mathcal{D}_{t-i}(\mathbf{x}_{t-i})\right), \ \mathbf{x}_{t+1} &= \mathrm{BN}_t(\mathbf{b}_{t+1}) \end{aligned} This incurs only hThT additional scalar parameters. Ablations show that learning real-valued α\alpha is essential: hard-thresholding to binarized values degrades accuracy by 1–2 points.

3. Evolution of Feature Geometry and Parameter Budget

Throughout, the channel count is kept constant at ff; spatial resolution evolves as

(Ht,Wt)=(H,W)/j<trj(H_t, W_t) = (H, W) / \prod_{j < t} r_j

where downsampling ratios rj{1,2}r_j \in \{1, 2\} are scheduled as needed. Downsampling can be applied regularly or irregularly, allowing explicit control over the MACs (multiply-accumulate operations) and feature map size. The full parameter count (excluding grouped convolutions) is: f2ab+2fT+hT+f×(#classes)f^2ab + 2fT + hT + f \times (\#\mathrm{classes}) Grouped/depthwise convolutions further reduce the dominant term to f(ab+f)f(ab + f).

4. Training and Inference Procedures

ThriftyNet models are trained with cross-entropy loss, SGD (learning rate $0.1$, momentum $0.9$), no explicit weight decay, and a step-down schedule at epochs $50$, $100$, and $150$ for a total of $200$ epochs. Data augmentation via random cropping and horizontal flipping is found to deliver most of the attainable gains; more aggressive augmentation (AutoAugment, mixup, cutmix, cutout) yields marginal or negative returns, as shown in the table below:

Augmentation Test Accuracy (%)
standard (flip + crop) 90.64
+ auto augment 91.00
+ cutout (size=8) 90.40
+ mixup 88.09
+ cutmix 88.47

Inference involves sequentially running all TT recursions; resource/accuracy trade-offs can be managed by adjusting TT or pooling schedules.

5. Empirical Performance and Ablation Analyses

Results on CIFAR-10/100 establish that ThriftyNet achieves parity with much larger networks at a fraction of their parameter budget. For CIFAR-10, with approximately 40K40\,\text{K} parameters:

  • ThriftyNet (h=5h=5, T=15T=15): 90.15±0.4290.15 \pm 0.42\%
  • Residual ThriftyNet (h=5h=5, T=30T=30): 91.08±0.3391.08 \pm 0.33\%
  • ResNet-20: $91.25$\% (using 270K270\,\text{K} params)

For CIFAR-100:

  • Residual ThriftyNet (h=5h=5, T=40T=40): $74.37$\% (600K600\,\text{K} parameters)
  • DenseNet-BC (L=100, K=12): $77.73$\% (800K800\,\text{K} parameters)
  • Wide-ResNet-16-8: $69.11$\% (600K600\,\text{K} parameters)

Ablations show moderate sensitivity to TT in the [15..45][15..45] range (for fixed budget), clear gains from increased downsamplings, and strong dependence of accuracy on number of filters ff. For example, increasing ff from $32$ ($3$k params) to $256$ ($80$k params) on CIFAR-10 boosts accuracy from 75%\sim75\% to 91.6%\sim91.6\%.

Iteration TT ThriftyNet Acc. (%) Residual Acc. (%)
10 89.2 89.3
20 90.47 90.71
45 90.18 90.95
Downsamplings nn ThriftyNet Acc. (%) Residual Acc. (%)
1 89.10 90.62
4 90.93 91.69

6. Computational and Design Trade-offs

ThriftyNets allow fine-grained adjustment between parameter count, TT, and computational cost (MACs) to target specific hardware profiles. Pooling can be front-loaded to halve computational cost at a tolerable accuracy loss (4–5 points). Grouped convolutions are available to further shrink the parameter budget. The principal trade-off is between parameter reuse (T), accuracy, and computational efficiency.

ThriftyNet's maximal parameter factorization contrasts with deepening/widening used in standard CNNs and may be viewed as an instance of recurrent convolutional networks with weight sharing. The architecture is distinct from parameter-pruning or quantization approaches, as it operates directly through strict model weight reuse. The term "ThriftyNet" is also encountered in wireless networking literature—there, a "ThriftyNet" is defined as one whose infrastructure is dimensioned by the user-perceived goodput, not peak physical throughput, and exploits coding techniques such as TCP/NC to maintain user-level performance while minimizing physical plant and energy costs. The key principle in both contexts is resource efficiency: in neural networks, via maximal parameter reuse; in wireless networks, via protocols that close the gap between throughput and goodput, minimizing required infrastructural spend (Coiffier et al., 2020, Kim et al., 2012).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ThriftyNets.