Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimal Learnable Parameters

Updated 1 February 2026
  • Minimal Learnable Parameters are the fewest trainable components needed to achieve stable and near-optimal performance in machine learning tasks.
  • Constructive approaches such as low-dimensional parameterization, replacement learning, and quantum circuits enable efficient model design.
  • Empirical analyses show that models operating at this minimal regime maintain robustness against aggressive pruning and quantization.

Minimal Learnable Parameters

Minimal learnable parameters refer to the smallest number of trainable components (weights, biases, gates, or other optimized scalars/tensors) necessary for a machine learning system to stably solve a given learning task at or near its maximal achievable performance. This concept unifies several research trajectories in neural network compression, architecture search, function approximation theory, structured pruning, quantum circuits, and the design of parsimonious end-to-end feature extractors. Recent works formalize and empirically determine minimal parameter counts for various architectures and tasks, provide constructive methodologies for minimizing learnable parameters, and analyze the critical effects of pruning, quantization, and training dynamics near these minimal boundaries.

1. Formal Definitions and Quantitative Characterization

The minimal learnable parameter regime is best formalized by linking model capacity (parameter count) to the convergence and stability of learning. Zheng et al. (Zheng et al., 25 Jan 2026) define the minimal parameter threshold PminP_{\min} as: Pmin=min{P:  A(P)A(Pmax)εA,  PP},with σ(P)εσ,P_{\min} = \min\Bigl\{\,P:\; A(P')\ge A(P_{\max}) - \varepsilon_A,\;\forall P'\ge P\,\Bigr\},\quad \text{with}~\sigma(P)\le \varepsilon_\sigma, where A(P)A(P) is the expected test accuracy at capacity PP, and σ(P)\sigma(P) captures repeatability (variance) of training outcomes. This threshold demarcates three regimes: (1) unstable/under-parameterized (high σ\sigma, low AA), (2) stable learning, (3) overfit/over-parameterized.

For specialized models, such as weighted automata, the minimal learnable parameters are given by the Hankel matrix rank rr and scale as 2r+Σr22r + |\Sigma| r^2 for a function of minimal WA-rank rr over alphabet Σ\Sigma (Kaznatcheev et al., 2020). For functional approximation, the number of intrinsic neural parameters n+2n+2 suffices to achieve exponential accuracy in nn for any Lipschitz continuous mapping ff over [0,1]d[0,1]^d (Shen et al., 2021).

Empirical architectural sweeps consistently reveal that minimal parameter thresholds are sufficient for near-maximum performance: for MNIST, Fashion-MNIST, and CIFAR-10, minimal PP values are on the order of 2×1042\times 10^4 (DNN), 3×1043\times 10^4 (CNN), and 5×1035\times 10^36×1036\times 10^3 (ViT) for stable learning (Zheng et al., 25 Jan 2026).

2. Constructive Approaches to Minimal Parameterization

Several algorithmic and architectural paradigms enable explicit construction of models with minimal learnable parameters.

Low-dimensional intrinsic parameterizations:

Lu et al. prove that a ReLU network with merely n+2n+2 intrinsic parameters (learned coefficients in a linear map plus scale and bias) suffices to approximate any Lipschitz function ff over [0,1]d[0,1]^d with error O(λd2n)O(\lambda \sqrt{d} 2^{-n}), provided the remaining network is fixed or pre-trained (Shen et al., 2021). Auxiliary sub-networks encode and decode the signal, but only the intrinsic block is adapted per problem.

Structured parameter sharing and replacement learning:

Replacement Learning replaces a full parameter tensor WiW_i in designated layers by a linear combination of neighboring tensors: Wi=αiWi1+βiWi+1,{αi,βi} learned scalarsW'_i = \alpha_i\, W_{i-1} + \beta_i\, W_{i+1},\quad \{\alpha_i,\beta_i\}~\text{learned scalars} yielding two learnable parameters per frozen layer (Zhang et al., 2024). This drastically reduces the total trainable parameter count while leveraging stored capacity in neighboring layers. Experimentally, this strategy not only reduces training time and hardware footprint but also consistently improves generalization performance across CNNs, ViTs, and multiple image recognition benchmarks.

Quantum parameterization:

Quantum graph neural networks can encode an NN-node graph using only O((logN)2)O((\log N)^2) quantum parameters, plus a minimal classical head (103\sim 10^35×1035\times 10^3 trainable weights), by substituting classical spectral filtering and pooling with log-qubit, polynomial-depth QFT circuits whose parameters are directly learned (Daskin, 8 Jul 2025). On graph benchmarks, this achieves state-of-the-art or superior accuracy to classical fully-parameterized models.

3. Architectural Learning and Structured Sparsification

Minimal learnable parameter regimes can also be reached by actively pruning redundant structure during training.

Differentiable architecture learning:

A tri-state ReLU with learned per-unit width gates wij[0,1]w_{ij}\in[0,1] and layer depth gates did_i enables simultaneous optimization of network weights and architecture (Srinivas et al., 2015). Smooth regularizers push wijw_{ij} and did_i towards {0,1}\{0,1\}, zeroing out disconnected neurons and linearizing (thus collapsing) superfluous layers. Empirical compression ratios are as high as 90–95% (MNIST) and 60–70% (ImageNet), with sub-1% accuracy drop.

Group-sparse multi-task compression:

Channel-wise group-lasso (1/2\ell_1/\ell_2) regularization removes redundant convolutional channels at inference, yielding up to 93% sparsity in shared backbones for multi-task learning on ResNet-50 (Upadhyay et al., 2023). The optimization combines a multi-task loss with a penalty summing ngθb,g2\sqrt{n_g}\|\theta_{b,g}\|_2 over groups (channels). Pruned models, retaining only 20–30% backbone parameters, consistently outperform dense baselines and gain up to 8% faster inference.

Learnable feature-frontends:

LEAF constructs audio feature extraction frontends with as few as 448 learnable scalars (for K=64K=64 filters) via a learnable Gabor filterbank, channel-wise smoothing, adaptive compression, and channel-wise normalization (Zeghidour et al., 2021). Despite orders-of-magnitude fewer parameters than alternatives (e.g., Wavegram: 300K), accuracy on audio tagging and event recognition is state-of-the-art.

4. Activation Function Parameter Minimality

Activation nonlinearities with minimal learnable parameters can considerably enhance expressivity and training dynamics.

Tangma activation:

The Tangma function introduces only two extra trainable scalars per neuron—α\alpha (horizontal shift of tanh\tanh) and γ\gamma (linear skip coefficient), defined as

Tangma(x;α,γ)=xtanh(x+α)+γx\text{Tangma}(x;\alpha,\gamma) = x\,\tanh(x + \alpha) + \gamma\,x

This guarantees non-vanishing gradients and tunable nonlinearity at the per-neuron level. Empirical results confirm strong stability, faster convergence, and up to 0.7% accuracy improvements over ReLU, Swish, or GELU, with just 0.1–1% parameter overhead in typical layers (Golwala, 2 Jul 2025).

5. Interaction with Pruning, Quantization, and Depth

The parameter regime at or near minimality is highly sensitive to further compression or reduced-precision constraints.

Pruning tolerance:

Once minimal parameter threshold PminP_{\min} is reached, DNNs can tolerate additional pruning up to 60% without significant accuracy loss on simple tasks, but CNNs and ViTs are less resilient (safe ratios 5–20%) on image classification (Zheng et al., 25 Jan 2026). Deeper networks exhibit greater redundancy, allowing more aggressive compression before performance degradation.

Quantization effects:

At PminP_{\min}, quantization-aware training (8-bit) can incur accuracy gaps up to 14.8% (CNN, CIFAR-10); this gap is lower (≤1%) for DNNs or on simpler tasks (MNIST, Fashion-MNIST). Increasing the parameter count by 2–5×\times above PminP_{\min} is effective for regaining margin lost to quantization.

6. Task-specific and Algorithmic Minimality in Non-NN Models

The principle of minimal learnable parameters extends beyond classical neural networks.

Weighted automata:

A function f:ΣFf:\Sigma^*\to\mathbb{F} is minimally parameterized by a weighted automaton with r=rankF(Hf)r=\operatorname{rank}_{\mathbb{F}}(H_f), resulting in 2r+Σr22r + |\Sigma|r^2 learnable scalars. The minimal automaton is also actively identifiable in polynomial query and runtime complexity via the generalized Angluin–Schapire algorithm, regardless of the potentially exponential size of equivalent nondeterministic representations (Kaznatcheev et al., 2020).

Quantum-circuit learning:

Hybrid classical–quantum GNNs with only minimal parameterized QFT elements and shallow prediction heads can match or exceed fully classical models on multiple graph classification datasets, achieving parameter counts orders-of-magnitude smaller than dense neural GNNs (Daskin, 8 Jul 2025).

7. Practical Guidelines and Performance Benchmarks

The unified empirical evidence supports several actionable strategies for practitioners:

  • Always target parameter count PPminP\geq P_{\min} as determined by model sweep (see Table below).
  • After reaching stable learning, structured pruning can safely remove between 20–60% additional parameters, depending on architecture depth and task complexity (Zheng et al., 25 Jan 2026, Upadhyay et al., 2023).
  • For quantization, models at PminP_{\min} are most sensitive to accuracy drop; increase parameter count if high-precision cannot be retained (Zheng et al., 25 Jan 2026).
  • Replacement learning or learnable gating strategies are highly effective for reducing effective parameter count while maintaining, or even improving, generalization (Zhang et al., 2024, Srinivas et al., 2015).
  • Architectural and regularization choices enabling group-wise or layer-wise parameter adaptation provide the most efficient route to task-optimal minimality.

Summary Table: Minimal Parameter Thresholds and Safe Pruning Quantiles

Model/Task Minimal Parameters (PminP_{\min}) Safe Pruning Ratio
DNN (MNIST) 2×1042\times 10^4 60%
CNN (MNIST) 3×1043\times 10^4 20%
ViT (MNIST) 5×1035\times 10^3 20%
DNN (CIFAR-10) 2×1062\times 10^6 40%
CNN (CIFAR-10) 3×1063\times 10^6 5%
ViT (CIFAR-10) 6×1036\times 10^3 10%

These quantitative results are representative and extracted directly from empirical convergence analyses (Zheng et al., 25 Jan 2026). Pruned/sparse and replacement-based models (including Tangma and LEAF-based systems) both reinforce the tightness of minimality: substantial compression and even parameter sharing can be achieved without meaningful performance loss when guided by principled methodologies (Golwala, 2 Jul 2025, Zeghidour et al., 2021, Zhang et al., 2024, Upadhyay et al., 2023).

In summary, the investigation of minimal learnable parameters provides a unifying lens for understanding capacity, efficiency, and robustness in modern machine learning. Through precise definitions, constructive design, and systematic empirical validation, this area establishes both theoretical and algorithmic foundations for developing compact, high-performing models across diverse domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimal Learnable Parameters.