ThriftyNets: Deep Learning with Shared Parameters

Updated 9 December 2025

ThriftyNets are deep learning architectures defined by maximal parameter sharing, where a single convolutional weight tensor is recursively applied to reduce model size.
They incorporate shortcut histories and adaptive downsampling to balance expressivity with computational efficiency and parameter budget.
Empirical results on benchmarks like CIFAR-10/100 show that ThriftyNets achieve competitive accuracy with significantly fewer parameters than traditional CNNs.

ThriftyNets denote a modern deep learning architecture class distinguished by maximal parameter sharing, achieving high classification accuracy with sharply reduced parameter counts. In the canonical ThriftyNet model, only a single convolutional layer weight tensor is defined, which is applied recursively across multiple processing steps, with additional mechanisms—nonlinearity, normalization, downsampling, and shortcut connections—ensuring sufficient expressivity. ThriftyNets depart from conventional practice, where network depth and width typically increase parameter count, instead leveraging recurrence and parameter factorization to attain competitive performance under severe parameter budget constraints (Coiffier et al., 2020). The term "ThriftyNet" also appears independently in the context of wireless system design, where it concerns resource-efficient networking by aligning physical throughput with application-level goodput, often through advanced coding and transport-layer design (Kim et al., 2012).

1. ThriftyNet Architecture and Parameter Factorization

The ThriftyNet architecture is defined by the recursive application of a single convolutional-layer weight tensor, $\mathbf{W} \in \mathbb{R}^{f \times f \times a \times b}$ , across $T$ iterations. The input image, padded to match $f$ channels, is processed as

$\mathbf{x}_0 = \mathrm{PAD}(\mathbf{x}) \in \mathbb{R}^{f \times H \times W}$

and evolved via

$\begin{aligned} \mathbf{a}_{t+1} &= \mathbf{W} \star \mathbf{x}_t, \ \mathbf{s}_{t+1} &= \mathbf{x}_t + \sigma(\mathbf{a}_{t+1}), \ \mathbf{y}_{t+1} &= \mathrm{BN}_t(\mathbf{s}_{t+1}), \ \mathbf{x}_{t+1} &= \mathcal{D}_t(\mathbf{y}_{t+1}), \end{aligned}$

with ReLU/tanh activations, step-dependent BatchNorm (with $2f$ affine parameters per step), and optional downsampling $\mathcal{D}_t$ . Parameter factorization yields a convolutional-layer budget of only $f^2ab$ (ungrouped), maintained constant across all processing steps.

2. Residual Connections and Shortcut History

Expressivity and trainability are further enhanced via shortcut histories. For history length $h$ , learnable scalars $\alpha_{t,i}$ combine up to $h$ past activations at each step: $\begin{aligned} \mathbf{b}_{t+1} &= \mathcal{D}_t(\mathbf{a}_{t+1}) + \sum_{i=0}^{\min(t,h)} \alpha_{t,i} \left(\circ_{j=0}^j \mathcal{D}_{t-i}(\mathbf{x}_{t-i})\right), \ \mathbf{x}_{t+1} &= \mathrm{BN}_t(\mathbf{b}_{t+1}) \end{aligned}$ This incurs only $hT$ additional scalar parameters. Ablations show that learning real-valued $\alpha$ is essential: hard-thresholding to binarized values degrades accuracy by 1–2 points.

3. Evolution of Feature Geometry and Parameter Budget

Throughout, the channel count is kept constant at $f$ ; spatial resolution evolves as

$(H_t, W_t) = (H, W) / \prod_{j < t} r_j$

where downsampling ratios $r_j \in \{1, 2\}$ are scheduled as needed. Downsampling can be applied regularly or irregularly, allowing explicit control over the MACs (multiply-accumulate operations) and feature map size. The full parameter count (excluding grouped convolutions) is: $f^2ab + 2fT + hT + f \times (\#\mathrm{classes})$ Grouped/depthwise convolutions further reduce the dominant term to $f(ab + f)$ .

4. Training and Inference Procedures

ThriftyNet models are trained with cross-entropy loss, SGD (learning rate $0.1$, momentum $0.9$), no explicit weight decay, and a step-down schedule at epochs $50$, $100$, and $150$ for a total of $200$ epochs. Data augmentation via random cropping and horizontal flipping is found to deliver most of the attainable gains; more aggressive augmentation (AutoAugment, mixup, cutmix, cutout) yields marginal or negative returns, as shown in the table below:

Augmentation	Test Accuracy (%)
standard (flip + crop)	90.64
+ auto augment	91.00
+ cutout (size=8)	90.40
+ mixup	88.09
+ cutmix	88.47

Inference involves sequentially running all $T$ recursions; resource/accuracy trade-offs can be managed by adjusting $T$ or pooling schedules.

5. Empirical Performance and Ablation Analyses

Results on CIFAR-10/100 establish that ThriftyNet achieves parity with much larger networks at a fraction of their parameter budget. For CIFAR-10, with approximately $40\,\text{K}$ parameters:

ThriftyNet ( $h=5$ , $T=15$ ): $90.15 \pm 0.42$ \%
Residual ThriftyNet ( $h=5$ , $T=30$ ): $91.08 \pm 0.33$ \%
ResNet-20: $91.25$\% (using $270\,\text{K}$ params)

For CIFAR-100:

Residual ThriftyNet ( $h=5$ , $T=40$ ): $74.37$\% ( $600\,\text{K}$ parameters)
DenseNet-BC (L=100, K=12): $77.73$\% ( $800\,\text{K}$ parameters)
Wide-ResNet-16-8: $69.11$\% ( $600\,\text{K}$ parameters)

Ablations show moderate sensitivity to $T$ in the $[15..45]$ range (for fixed budget), clear gains from increased downsamplings, and strong dependence of accuracy on number of filters $f$ . For example, increasing $f$ from $32$ ($3$k params) to $256$ ($80$k params) on CIFAR-10 boosts accuracy from $\sim75\%$ to $\sim91.6\%$ .

Iteration $T$	ThriftyNet Acc. (%)	Residual Acc. (%)
10	89.2	89.3
20	90.47	90.71
45	90.18	90.95

Downsamplings $n$	ThriftyNet Acc. (%)	Residual Acc. (%)
1	89.10	90.62
4	90.93	91.69

6. Computational and Design Trade-offs

ThriftyNets allow fine-grained adjustment between parameter count, $T$ , and computational cost (MACs) to target specific hardware profiles. Pooling can be front-loaded to halve computational cost at a tolerable accuracy loss (4–5 points). Grouped convolutions are available to further shrink the parameter budget. The principal trade-off is between parameter reuse (T), accuracy, and computational efficiency.

ThriftyNet's maximal parameter factorization contrasts with deepening/widening used in standard CNNs and may be viewed as an instance of recurrent convolutional networks with weight sharing. The architecture is distinct from parameter-pruning or quantization approaches, as it operates directly through strict model weight reuse. The term "ThriftyNet" is also encountered in wireless networking literature—there, a "ThriftyNet" is defined as one whose infrastructure is dimensioned by the user-perceived goodput, not peak physical throughput, and exploits coding techniques such as TCP/NC to maintain user-level performance while minimizing physical plant and energy costs. The key principle in both contexts is resource efficiency: in neural networks, via maximal parameter reuse; in wireless networks, via protocols that close the gap between throughput and goodput, minimizing required infrastructural spend (Coiffier et al., 2020, Kim et al., 2012).

Markdown Upgrade to Chat

References (2)

ThriftyNets : Convolutional Neural Networks with Tiny Parameter Budget (2020)

Trade-off between cost and goodput in wireless: Replacing transmitters with coding (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ThriftyNets.