Curse of Simple Size

Updated 23 January 2026

Curse of simple size is a phenomenon describing inherent statistical and algorithmic limitations due to insufficient sample size, leading to slow convergence and inflated bias.
It imposes strict sample complexity bounds in nonparametric inference, contrastive learning, and topological data analysis, often requiring exponentially more data in high-dimensional settings.
Mitigation strategies include exploiting problem structure, applying dimensionality reduction, and using bias corrections and batch size adaptations to improve generalization.

The "curse of simple size" describes fundamental statistical and algorithmic barriers arising from limited sample size (or small batch size, or insufficient scaling) across a range of estimation, learning, and inference settings. It manifests as slow, nonparametric convergence rates, severe generalization gaps, inflated selection biases, saturation of information-theoretic bounds, and instability in various nonparametric algorithms—including contrastive learning, set cardinality estimation, persistence diagram analysis, and statistical forecasting. The following sections synthesize the core mechanisms, theoretical results, and mitigation strategies underlying the curse of simple size.

1. Fundamental Limits: Sample Size, Structure, and Minimax Rates

Small sample size is a primary determinant of attainable risk in nonparametric inference, especially in high-dimensional and "low-structure" settings.

In nonparametric $L^2$ -goodness-of-fit testing on $[0,1]^d$ , the minimax detectable discrepancy $\epsilon^*$ between two densities $f$ and $g$ (both in Hölder class $H^d_s(L)$ ) satisfies

$\epsilon^*(n,d,s) \asymp n^{-2s/(4s+d)}$

so that to reliably detect a fixed discrepancy $\epsilon$ , one must have

$n \gtrsim \epsilon^{-(4s+d)/(2s)}$

which—at fixed smoothness $s$ —grows exponentially in $d$ : this is the archetypal "curse of dimensionality" as analyzed in the minimax sense (Arias-Castro et al., 2016).

When estimating the size $N=|S|$ $N = ∣ S ∣$ of a finite set $S$ $S$ via uniform i.i.d. sampling, the required $n$ $n$ to achieve consistent estimation depends fundamentally on the structural complexity of $S$ $S$ :
- If $S$ is arbitrary, $n = O(N^{1/2})$ is necessary (the "birthday regime"), due to the time to first collision.
- If $S = \{1, ..., N\}$ has total order, $n\to\infty$ suffices (the "German tank problem").
- More generally, $n$ scales with geometric/partial order parameters (width of a poset, dimension of a convex body, etc.) (Chatterjee et al., 7 Aug 2025).
In high-dimensional function approximation, standard tensor-product methods require $O(n^d)$ samples, but in certain structured function classes (e.g., Kolmogorov–Lipschitz) this curse can be broken: only $O(nd)$ samples and parameters yield nearly optimal $O(n^{-1})$ $L^\infty$ error rates, via KST representations, careful spline bases, and matrix cross-approximation (Lai et al., 2021).

These facts illustrate that any estimation accuracy in absence of favorable structure is throttled by the available sample size and the intrinsic complexity of the problem instance.

2. Statistical Generalization, Overfitting, and Model Complexity

In predictive statistical modeling, especially when sample size $N$ is small relative to effective model complexity $h$ (e.g., VC dimension, number of free parameters), generalization error exhibits a slow $O(1/\sqrt{N})$ decay and overfitting becomes prevalent.

The generalization gap between empirical and population risk,

$|E_{\text{gen}} - E_{\text{emp}}| \leq O\left(\sqrt{ \frac{ h(\ln(N/h)+1) - \ln\delta }{ N } } \right)$

implies that only very short-horizon predictions, or aggressive feature selection/regularization, yield reliable forecasts when $N$ is limited (Nakıp et al., 2020).

In real-world applications, as demonstrated in COVID-19 case forecasting, even the most sophisticated ML models (linear regression, MLP, LSTM) fail to generalize on medium- or long-horizon prediction tasks due to curse of small $N$ . Only extremely short-range predictions (e.g., 3-day horizons) remain accurate, and increasing model complexity amplifies overfitting risks rather than mitigating them (Nakıp et al., 2020).
Feature selection (correlation filtering, recursive selection, lasso) can partially alleviate the curse, but the barrier remains when $N$ is fundamentally too small relative to input size or model complexity.

Thus, reliable generalization depends not only on algorithmic sophistication but on the fundamental scaling of sample size with respect to the problem's complexity.

3. Selection Effects, Winner's Curse, and Post-Selection Bias

Conditioning on rare or "significant" events in small sample settings (e.g., reporting only statistically significant effects) introduces systematic overestimation and undercoverage—phenomena encapsulated by the "winner's curse".

Suppose an estimate $b \sim N(\beta, \mathrm{se}^2)$ is reported only if $|b|/\mathrm{se} > c$ for some threshold $c$ ("significance filter"). Then, the conditional expectation

$\mathbb{E}[|b| ~|~ |b|/\mathrm{se}>c] = |\beta| + \mathrm{se} \cdot g(|\beta|/\mathrm{se}, c)$

is always greater than $|\beta|$ ; the relative bias (exaggeration) is a strictly decreasing function of power, but substantial at low power (exaggeration factors can exceed $1.5$ at 20% power) (Zwet et al., 2020).

The usual (unconditional) confidence interval nominal coverage $1-\alpha$ fails post-selection; if power $\le 50\%$ , conditional coverage drops below $1-\alpha$ .
Shrinkage corrections are effective: Bayesian normal–normal estimators and frequentist conditional likelihood corrections reduce or eliminate the winner's curse, especially as $c\to\infty$ (Zwet et al., 2020).

Failure to correct for selection-induced inflation can lead to misleading inferences in studies with small or moderate sample sizes, particularly when low power is prevalent.

4. Algorithmic Bottlenecks: Saturation in Contrastive Learning and Statistical Re-weighting

Specific algorithms—especially those relying on classification over negatives or statistical re-weighting—exhibit structural bottlenecks as sample size or batch size decreases.

Contrastive Learning:

The InfoNCE loss, prevalent in self-supervised contrastive learning (e.g., SimCLR, MoCo), upper bounds the mutual information estimator at $\log K$ , where $K$ is the batch (negative) size.
For small $K$ $K$ , two simultaneous bottlenecks emerge:
- The loss saturates at $\log K$ , so learning signal vanishes.
- Gradients collapse to zero due to softmax numerics ("vanishing gradient pathology"), especially in low-precision computation (Chen et al., 2021).
This "log $K$ curse" necessitates very large batches for effective contrastive training.
The FlatNCE objective, as a dual formulation, completely removes the $\log K$ ceiling by generating non-vanishing, importance-weighted gradients for any $K$ . Empirically, FlatNCE enables high performance at much smaller $K$ with minimal code changes (Chen et al., 2021).

Re-weighted Sampling:

In importance sampling (re-weighting) with increasing system size (e.g., high-order path integral molecular dynamics), variance of the estimator scales as

$\mathrm{Var}(\bar{A}_n) \sim \frac{1}{n} \exp(cN)$

so the sample size needed for fixed statistical error grows exponentially with number of degrees of freedom $N$ : the "curse of system size" (Ceriotti et al., 2011).

Once the variance of the "difference Hamiltonian" exceeds unity, both statistical uncertainty and estimator bias become unmanageable; only modest system sizes remain tractable under re-weighting.
Mitigation strategies include increasing number of replicas (with computational tradeoffs), sampling under the true (often intractable) target measure, or switching to alternative acceleration methods such as PI-GLE thermostats (Ceriotti et al., 2011).

5. Failures in Topological Data Analysis: High-Dimension, Low-Sample Regimes

In topological data analysis (TDA), the reliability of persistent homology and persistence diagrams deteriorates in high-dimension, low-sample-size (HDLSS) regimes.

As $d \to \infty$ and $n$ fixed, for i.i.d. Gaussian-perturbed point clouds, the bottleneck and Hausdorff distances between the true and observed persistence diagrams diverge or remain bounded yet meaningless. The observed diagrams are dominated by high-dimensional noise, and topological features vanish (Hiraoka et al., 2024).
This is termed the "curse of dimensionality on persistence diagrams”: observed descriptors fail to faithfully represent underlying topology in HDLSS settings.
Dimensionality reduction using normalized principal component analysis (PCA) (with appropriate score normalization) can partially mitigate the curse for Rips filtrations by compressing to the intrinsic dimension, preserving feature stability to $O(1)$ as $d$ increases. However, exact recovery remains unattainable, and dimension-sensitivity persists for Čech filtrations (Hiraoka et al., 2024).

These results caution against naive TDA without dimension reduction in HDLSS settings, and suggest robust workflows leveraging normalized PCA for reliable topological inference.

6. Scaling Anomalies in Deep Learning: U-Shaped and Inverse Scaling

In LLMs and foundation models, the curse of (simple) size can manifest as "inverse scaling" or U-shaped scaling, where increasing model or dataset size initially degrades, then later restores or even improves, performance.

Empirical studies of the Inverse Scaling Prize tasks found, for many benchmarks, that as model parameter count increases:
- Performance first declines (inverse scaling: larger models worse).
- After a regime transition ("knee"), performance recovers (U-shaped scaling) (Wei et al., 2022).
This phenomenon is often due to "distractor tasks"—spurious heuristics or shortcuts available to mid-sized models, but which can be suppressed by the largest models.
Simple mitigation (1-shot learning, chain-of-thought prompting) can abrogate inverse scaling on several tasks, converting inverse or U-shaped curves to monotonically improving ones at scale (Wei et al., 2022).

The implication is that monotonic improvement with increased model or batch size cannot be presumed; robust evaluations must consider these scaling pathologies and account for adverse behaviors at intermediate scales.

7. Mitigation and Recommendations

Addressing the curse of simple size requires both structural and algorithmic interventions, tailored to each scenario:

Exploit structure: Any additional order, geometry, or smoothness (low intrinsic dimension, manifold structure, known partial order) can exponentially reduce sample size demands, as formalized in non-asymptotic minimax bounds (Chatterjee et al., 7 Aug 2025, Lai et al., 2021).
Dimension reduction: In HDLSS regimes, utilize PCA or variants to project to low-intrinsic-dimension representations before statistical or topological inference (Hiraoka et al., 2024).
Statistical corrections: Apply shrinkage (Bayesian or frequentist), bias adjustments, and post-selection intervals to counter winner's curse and inflated effect sizes (Zwet et al., 2020).
Batch size adaptation: For contrastive learners, switch to objectives like FlatNCE to sidestep the log-batch-size bottleneck, and monitor gradient diversity with diagnostics such as effective sample size (Chen et al., 2021).
Algorithmic diversification: Avoid re-weighted sampling on large systems; consider direct sampling or alternative stochastic algorithms for efficient high-dimensional integration (Ceriotti et al., 2011).
Study design: Ensure sufficient unconditional power in hypothesis testing to minimize bias inflation and undercoverage; adjust effect-size estimation in pilot studies and meta-analyses.

In summary, the curse of simple size is a latent constraint permeating estimation, inference, and learning, rooted in the interplay between sample complexity, intrinsic problem structure, and algorithmic limitations. While structural regularity or carefully chosen estimators can break or elude the curse in specific function spaces or settings, the general paradigm is one of sharply adverse scaling unless countered by explicit problem structure or methodological innovation.