Sample Complexity

Updated 7 April 2026

Sample complexity is the minimal number of data points required to achieve a specified accuracy in estimation and learning tasks, governed by factors like dimensionality, noise, and model structure.
It plays a crucial role across diverse fields—including PAC learning, stochastic optimization, and quantum information—informing both theoretical bounds and practical algorithm designs.
Recent advances employ methods such as information theory, duality techniques, and replica approaches to refine bounds and clarify the statistical-computational trade-offs in complex settings.

Sample complexity quantifies the minimal number of data points, measurements, or trials necessary to achieve a prescribed accuracy in statistical estimation, learning, or control, with high confidence. It is central in statistical learning theory, empirical risk minimization, stochastic optimization, system identification, quantum information theory, and a wide range of applied domains. The optimal scaling of sample complexity is determined by problem structure—such as dimensionality, noise model, hypothesis class complexity, information-theoretic lower bounds, and (where relevant) privacy or computational constraints.

1. Core Definitions and General Principles

The sample complexity $n^*(\epsilon, \delta)$ of a learning or estimation task is the smallest $n$ such that, with probability at least $1-\delta$ , an estimator trained on $n$ i.i.d. samples outputs a solution achieving target accuracy parameterized by $\epsilon$ (e.g., in mean-square error, probability of misclassification, excess risk, or parameter distance). For example, in PAC learning, the canonical form is $n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$ for concept classes of VC dimension $d$ in the realizable regime (Hanneke, 2015). In estimation and control problems, sample complexity typically refers to the number of sample points or time steps needed for a learning or identification algorithm to reach accuracy and confidence targets in the PAC or minimax sense (Jedra et al., 2019).

The central phenomena governing sample complexity include:

The combinatorial or metric complexity of the function class (e.g., VC dimension, covering numbers, Rademacher complexity, representation dimension).
Information-theoretic constraints such as Fisher information, minimax risk, and distinguishability under noise or corruption.
Task structure, e.g., i.i.d. versus dependent samples, convexity, sparsity, privacy or robustness requirements.

2. Sample Complexity in Supervised and PAC Learning

In binary PAC learning with concept class $C$ of VC-dimension $d$ , the sample complexity achieves the classic rate

$n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$

for realizable classification (Hanneke, 2015). This rate is optimal up to constant factors. For general loss minimization with bounded loss and finite hypothesis class $n$ 0, standard bounds yield

$n$ 1

for uniform deviation control (via Hoeffding or Massart's finite class lemma).

Extensions and variants include:

Split-sample growth rate: Instead of worst-case combinatorial measures, one can control sample complexity via the maximum number of distinct ERMs possible on sample subsamples. For example, in auction design, if ERMs only select observed values, this reduces generalization complexity to $n$ 2 (Syrgkanis, 2017).
Distribution-dependent and improper learning: For differentially-private learning, the representation dimension (RepDim) rather than VC-dimension governs sample complexity, potentially increasing the required number of examples by a logarithmic or worse factor (Beimel et al., 2014).

3. Sample Complexity in Convex and Stochastic Optimization

In stochastic convex optimization (SCO), Feldman and Neeman resolved the fundamental open question of the sample complexity for empirical risk minimization. For losses that are convex, Lipschitz, and bounded over the unit ball in $n$ 3, the optimal rate for all ERMs is

$n$ 4

(Carmon et al., 2023). This is strictly better than uniform convergence, which requires $n$ 5. Proofs rely on covering-number arguments, Bregman divergence, and separating the complexity of the parameter space from the linear functional class (Rademacher complexity for linear functions).

The rate $n$ 6 arises from the combination of (i) the cost of covering the parameter space to $n$ 7 accuracy and (ii) controlling variance due to stochasticity. The structure generalizes to symmetric convex bodies, where the linear-function Rademacher complexity may further refine the rate.

4. Information-Theoretic, Statistical and Algorithmic Minima

Sample complexity lower bounds are often established via a minimax risk argument, information divergence, or a primal-dual variational characterization. For example, in estimation with sub-Gaussian noise or in system identification, the minimax lower bound is

$n$ 8

reflecting the information required to confidently distinguish two nearly indistinguishable models (Jedra et al., 2019). In parametric models, Fisher information provides the exact scaling: For maximum-likelihood estimation of $n$ 9 under differentiable families,

$1-\delta$ 0

where $1-\delta$ 1 is the Fisher information matrix. This governs both upper and lower bounds in classical and quantum settings (Kwon et al., 25 Feb 2026).

Algorithmic barriers may arise from statistical correlations, noise, or the absence of entanglement/memory (as in quantum tomography, where sample complexity jumps from polynomial to exponential in $1-\delta$ 2 according to the measurement architecture (Kwon et al., 25 Feb 2026)).

5. Sample Complexity in Structured and High-Dimensional Settings

Dictionary Learning and Sparse Recovery

In Bayesian-optimal dictionary learning under the planted sparse coding model, perfect identification of the overcomplete dictionary $1-\delta$ 3 requires only

$1-\delta$ 4

samples, with $1-\delta$ 5 the overcompleteness and $1-\delta$ 6 the sparsity rate; perfect recovery is possible whenever $1-\delta$ 7 (Sakata et al., 2013). A replica method reveals "success," "middle," and "failure" phases, and the linear-in- $1-\delta$ 8 rate is both information-theoretically necessary and algorithmically achievable when the Bayes posterior dominates.

Population Recovery and Distinct Elements

In lossy population recovery (bit-erasure model), the sample complexity transitions from

$1-\delta$ 9

for $n$ 0 (parametric) to

$n$ 1

for $n$ 2 (nonparametric regime) (Polyanskiy et al., 2017). In noisy recovery, the scaling is superpolynomial in $n$ 3, namely

$n$ 4

The phase transition and scaling are provable by duality and primal-dual linear programming techniques.

High-Dimensional Simplices under Noise

For learning a $n$ 5-simplex in $n$ 6 from $n$ 7 samples with additive Gaussian noise, the sample complexity is

$n$ 8

where SNR is the signal-to-noise ratio. When $n$ 9, the rate matches the noiseless regime; otherwise, exponentially many samples are needed (Saberi et al., 2022).

Nonparametric Estimation with Structure

Dimension-independent rates are achievable for density estimation in multiview (low-rank) or more generally, non-negative Lipschitz spectrum decompositions: the convergence rate depends polynomially only on the decay/growth of the NL-spectrum; worst-case minimax rates for classical histograms remain dominated by the "curse of dimensionality" unless latent structure is exploited (Vandermeulen, 2023).

6. Sample Complexity under Constraints: Privacy, Forecasting, and Data Compression

Differential Privacy

The sample complexity of private learning is characterized by the representation dimension $\epsilon$ 0, which can be much larger than the VC-dimension in some regimes. The best possible rate (for improper PAC learning with privacy $\epsilon$ 1) is

$\epsilon$ 2

(Beimel et al., 2014). For simple classes (e.g., threshold or point functions), improper learners may avoid the exponential price in sample complexity that proper differentially-private learners incur.

Forecast Aggregation

In Bayesian forecast aggregation from expert predictions, the sample complexity is exponential in $\epsilon$ 3 (number of experts), $\epsilon$ 4, unless the experts' signals are conditionally independent, in which case the rate drops to $\epsilon$ 5, independent of $\epsilon$ 6 (Lin et al., 2022).

Lossless Data Compression

The sample complexity of achieving specified excess-rate in lossless compression of a memoryless source is governed, not by Shannon entropy, but by the Rényi entropy of order $\epsilon$ 7: $\epsilon$ 8 with $\epsilon$ 9 the (order-1/2) Rényi divergence to the uniform distribution (Viaud et al., 10 Jan 2026). This coincides with the sample complexity for identity testing, establishing a deep link between data compression and hypothesis testing.

7. Sample Complexity in Quantum and Dynamical Systems

Quantum parameter and channel identification exhibits sample complexity scaling as $n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$ 0 determined by the inverse Fisher information matrix, provided optimal quantum probe/measurement setups are used. Without entanglement or quantum memory, sample complexity may become exponential in system size, e.g., $n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$ 1 for $n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$ 2-qubit Pauli channel learning with separable probes (Kwon et al., 25 Feb 2026).

In dynamical systems identification, the PAC sample complexity for Jacobi matrix estimation, or for nonlinear mappings via Koopman decompositions, universally obeys minimax rates: $n^*(\epsilon, \delta) = \Theta\left(\frac{d+\log(1/\delta)}{\epsilon}\right)$ 3 with problem-dependent constants set by Gramian, controllability, and dictionary conditioning (Jedra et al., 2019, Chen et al., 2018).

8. Recent Methodological Developments

Methodological innovations in sample complexity analysis include:

Replica methods and phase diagrams (for sparse coding and statistical inference) reveal algorithmic and information-theoretical thresholds (Sakata et al., 2013).
Duality and LP-based minimax lower bounds (in population recovery, estimation) unify linear estimator construction with complexity lower bounds (Polyanskiy et al., 2017).
Error decomposition and optimization-statistics interaction (in deep discrete diffusion, conditional stochastic optimization) clarify how estimation error distributes among statistical, optimization, model-misspecification, and discretization components (Srikanth et al., 12 Oct 2025, Hu et al., 2019).
Intrinsic covering number and effective dimension (in entropic OT, density estimation with structure) replace ambient dimension in controlling sample complexity (Wiesel, 7 Oct 2025, Vandermeulen, 2023).

These advances continue to unify sample complexity theory across supervised, unsupervised, and reinforcement learning, estimation, control, and quantum information, with deep connections to combinatorial geometry, optimization, and statistical physics.