Sample-Optimal Learning Algorithms

Updated 21 October 2025

Sample-optimal learning algorithms are methods that achieve the minimal number of data samples required to meet a target generalization error by matching foundational information-theoretic lower bounds.
They leverage structural properties in data through adaptive, tournament, and instance-optimal strategies, enabling efficient performance in diverse settings such as high-dimensional inference, reinforcement learning, and quantum state tomography.
These algorithms balance statistical optimality with computational efficiency and robustness, making them crucial for scalable applications in noisy, high-dimensional, and weakly supervised environments.

Sample-optimal learning algorithms are procedures that achieve the minimal possible number of data samples or trajectories required to guarantee a target generalization error, while often simultaneously emphasizing computational efficiency, robustness, and adaptability to problem structure. Such algorithms play a central role in modern statistical learning theory, high-dimensional inference, reinforcement learning, and computational statistics. Recent research has established tight characterizations for classical supervised learning, high-dimensional unsupervised estimation, structured probabilistic modeling, sequential decision making, boosting, quantum state tomography, and partial information problems such as learning from label proportions.

1. Fundamental Principles and Information-Theoretic Lower Bounds

The primary goal in sample-optimal learning is to match known information-theoretic lower bounds for excess risk, estimation error, or regret with explicit algorithms. Classical PAC (Probably Approximately Correct) learning establishes that for a concept class of VC-dimension $d$ , the optimal sample complexity in the realizable setting is

$m(\epsilon, \delta) = \Theta\left( \frac{d + \ln(1/\delta)}{\epsilon} \right)$

where $\epsilon$ is the target error and $\delta$ the failure probability (Hanneke, 2015, Larsen, 2022).

In density estimation and distribution learning, minimax lower bounds for total variation or $L_1$ loss scale as $\Omega(1/\epsilon^2)$ for most parametric families (e.g., mixtures of two Gaussians) (Daskalakis et al., 2013). For tree-structured graphical models, the optimal sample complexity scales as $O(n\ln n/\epsilon^2)$ for $n$ variables (Daskalakis et al., 2020), with matching lower bounds in both realizable (true tree) and agnostic settings (Gayen et al., 18 Nov 2024). In reinforcement learning, achieving $\epsilon$ -optimal policies in Markov decision processes (MDPs) with function approximation or large state spaces requires sample complexity scaling as $O(1/\epsilon^2)$ in the best-known results for tabular RL and $O(K/(1-\gamma)^3\epsilon^2)$ when the transition model is linear in $K$ features (Yang et al., 2019).

Sample-optimality hinges on marrying algorithmic design with statistical limits: any further reduction in the number of samples would contravene minimax lower bounds or classical reductions such as Fano's inequality, Assouad's lemma, or Le Cam's two-point method.

2. Algorithmic Strategies and Structural Exploitation

Sample-optimal algorithms exploit structure in both the data and the hypothesis space, often using adaptive or instance-dependent procedures. The following strategies are predominant:

Tournament and Majority-Vote Schemes: For hypothesis selection among $N$ candidates, improved “tournament-style” algorithms achieve $O(\log N / \epsilon^2)$ sample complexity and $O(N\log N/\epsilon^2)$ time by recursively sub-sampling and utilizing robust pairwise estimators (Daskalakis et al., 2013).
Instance-Optimal and Adaptive Learning: Instance-optimal approaches, such as the fingerprint-matching/Poissonization method for discrete distributions, decouple estimation (of histogram probabilities) from labeling, yielding error guarantees that adapt to each instance rather than worst-case scenarios (Valiant et al., 2015, Saha et al., 2019).
Piecewise Polynomial Estimation: For univariate densities, greedy and projection-oracle-based estimators achieve optimal $L_1$ error and nearly-linear runtime by adaptive merging and convex optimization using separation oracles tailored to the A $_k$ -norm (Acharya et al., 2015).
Compression and Local Entropy Methods: Rigorous risk bounds are derived via sample compression schemes (storing only critical examples, such as support vectors in SVMs) and local entropies (measuring function class complexity in local neighborhoods), leading to optimal or near-optimal guarantees for classification, SVMs, and online-to-batch conversions (Zhivotovskiy, 2017, Bousquet et al., 2020).
Sequential Weak-to-Strong Learners: Boosting algorithms that combine a handful of base learners via partition, majority vote, or bagging—e.g., the “Majority-of-5” AdaBoost variant—remove superfluous logarithmic factors and obtain sample-optimal weak-to-strong bounds $O(d/(\gamma^2 m))$ , where $\gamma$ is the weak learner margin (Høgsgaard et al., 30 Aug 2024, Larsen, 2022).

In all cases, the algorithmic architecture is intricately linked to statistical structure: reliance on ERM is often mitigated through subsampling, boosting, or bagging, especially if one seeks to decouple computational efficiency from statistical sample size (Høgsgaard, 5 Feb 2025).

3. Sample-Optimality Beyond Supervised Learning

High-Dimensional and Structured Estimation

For high-dimensional Gaussian tree models, sample-optimal structure learning is achieved by coupling the Chow–Liu algorithm with novel conditional mutual information testers based on regression residuals. These testers achieve $O(1/\epsilon)$ sample complexity for independence detection, which is shown to be tight. Consequently, tree-structured Gaussian models can be learned in $\widetilde{O}(n/\epsilon)$ samples (realizable case), whereas no-tree structural assumption necessitates quadratically more data $O(n^2/\epsilon^2)$ (Gayen et al., 18 Nov 2024).

In Ising tree models, sample-optimality is obtained via refined analysis of the Chow–Liu algorithm, employing strong $4$-consistency and Hellinger subadditivity, ensuring $O(n\ln n/\epsilon^2)$ samples are sufficient even under arbitrary edge strengths (Daskalakis et al., 2020).

Reinforcement Learning with Function Approximation

In MDPs with large or continuous state spaces, sample-optimality with respect to the intrinsic problem dimension is achieved in multiple settings:

Linear MDPs: Parametric Q-learning with anchored feature representations attains a tight $O(K/(1-\gamma)^3\epsilon^2)$ sample complexity, with variance reduction and monotonicity preservation as key algorithmic components (Yang et al., 2019).
Actor–Critic Algorithms: By integrating optimism for exploration, off-policy critic estimation, and rare-switching resets, actor–critic algorithms reach $O(d H^5 \log|\mathcal{A}|/\epsilon^2)$ trajectory complexity, matching tabular RL lower bounds in the presence of function approximation, as long as the Bellman eluder dimension $d$ is controlled (Tan et al., 6 May 2025).
Hybrid RL: Utilizing both offline and online samples allows non-optimistic actor–critics to achieve efficient learning when the number of offline samples exceeds the required threshold by a problem-dependent concentrability coefficient, resolving a major open problem in policy learning with hybrid data.

Partial Information, Weak Supervision, and Quantum Learning

Learning from Label Proportions (LLP): The sample complexity under square loss is shown to be $O(k/\beta^2)$ in the most general (non-realizable) case and $O(k/\beta)$ in the realizable case, where $k$ is the bag size and $\beta$ the excess risk, matching lower bounds up to logarithmic terms (Busa-Fekete et al., 8 May 2025). This result is achieved through ERM/SGD with aggressive variance reduction and bag-level clipping techniques, markedly improving on previous quadratic or cubic dependencies on $k$ .
Quantum State Tomography: For learning $n$ -qubit phase states, the optimal sample complexity is $\Theta(n^d)$ with separable measurements and $\Theta(n^{d-1})$ with entangled joint measurements, where $d$ is the degree of the defining Boolean polynomial and the bounds are tight (Arunachalam et al., 2022). The random partial derivative sampling and Pretty Good Measurement (PGM) strategies are key. Variants accommodate sparsity, low Fourier degree, and global depolarizing noise.

4. Robustness, Adaptive Sample Complexity, and Instance-Optimality

Many sample-optimal algorithms exhibit robustness to model misspecification, heavy tails, or structured noise (malicious, nasty, or agnostic). The following features are central:

Instance-Adaptive Rates: In models such as Plackett–Luce, PAC–Wrapper eliminates arms based on empirical gap estimates, guaranteeing sample complexity scaling as $\sum_{i=2}^n \max\{1, 1/\Delta_i^2\}$ , where $\Delta_i$ is the gap between the top-ranked and $i$ th item, and improved multiplicatively by $1/m$ when receiving richer $m$ -ranked feedback (Saha et al., 2019).
Agnostic Boosting with Unlabeled Data: By exploiting potentials of the form $\phi(z, y) = \psi(z) - y z$ , agnostic boosting algorithms can allocate the gradient estimation across inexpensive unlabeled data (for $\psi'$ ) and sparse label queries (for $y$ ), viscerally reducing labeled sample complexity to that of ERM (Ghai et al., 6 Mar 2025).
Learning with Strong Adversarial Noise: Label-efficient halfspace learning under malicious or nasty noise achieves near-optimal sample complexity $\tilde{O}(d)$ using phase-wise instance localization, variance control by matrix Chernoff inequalities, and iterative outlier suppression (Shen, 2021).

The theme is a conscious algorithmic exploitation of structure, whether in the distributional instance, feedback structure, resource constraints (such as label scarcity), or noise model.

5. Computational Efficiency and Trade-offs

Sample-optimality is frequently accompanied by considerations of computational resources, given that naive implementations may suffer excessive overhead:

For PAC learning, recursive or bootstrapped majority-vote algorithms such as those of Hanneke and Breiman achieve the optimal rate but computational cost is tied to the expense of ERM on possibly large subsamples (Hanneke, 2015, Larsen, 2022).
Recent work (Høgsgaard, 5 Feb 2025) offers a trade-off: by invoking the ERM on small subsamples of size $O(d)$ —where $d$ is the VC-dimension—rather than the full data, and then aggregating via boosting or bagging, the total computation is reduced to nearly linear in $m$ (number of samples), with only a logarithmic blowup in the number of weak learner calls. This insight is crucial for deploying sample-optimal algorithms at scale.
In density estimation, convex optimization via separation oracles and dynamic programming/greedy merging for finite partitions enable nearly-linear time runtime while maintaining optimal statistical performance (Acharya et al., 2015, Diakonikolas et al., 2018).

The decoupling of statistical from computational efficiency is a major asset for contemporary applications, especially in large-data regimes.

6. Open Problems and Future Directions

Several persistent questions and new research directions remain:

Optimality for Proper Learning: Understanding when proper learners (those restricted to outputting hypotheses from a specific class) can achieve optimal sample complexity remains an active area, with the dual Helly number providing a key combinatorial characterization (Bousquet et al., 2020).
Optimality under Generalized Feedback: Extending the adaptive and sample-optimal instance-based framework to further partial-information settings, more general structured ranking, and weak supervision tasks.
Quantum Learning and Beyond: Determining quantum-classical separations in sample complexity for function classes with range larger than two, or under more general noisy measurement models.
Beyond $L_1$ and TV Losses: Generalizing instance- and sample-optimal learning to other metrics (e.g., KL-divergence, Wasserstein distance) and structured tasks (e.g., structured prediction, multitask learning).
Distributed and Online Efficiency: Leveraging the decoupling of computational and statistical costs for distributed, federated, or online implementations, where parallelism, communication, or adaptivity constraints dominate.
Reinforcement Learning with Broader Function Classes: Tightening sample-optimality and regret guarantees in domains with richer function approximation, deeper neural policies, or in continuous action spaces, and establishing minimal sufficient statistics for strategic exploration (Tan et al., 6 May 2025).

The ongoing refinement of robust, adaptive, and computationally scalable sample-optimal learning algorithms remains a key focus for both theoretical development and practical deployment in large-scale, noisy, and high-dimensional environments.