Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards a statistical theory of data selection under weak supervision (2309.14563v2)

Published 25 Sep 2023 in stat.ML and cs.LG

Abstract: Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples ${{\boldsymbol x}i}{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by ${{\boldsymbol x}i}{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
  2. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2):749–772, 2021.
  3. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  4. Influential observations, high leverage points, and outliers in linear regression. Statistical science, pages 379–393, 1986.
  5. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
  6. Large deviations in the perceptron model and consequences for active learning. Machine Learning: Science and Technology, 2(4):045001, 2021.
  7. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127–1136, 2006.
  8. Faster least squares approximation. Numerische mathematik, 117(2):219–249, 2011.
  9. Deep bayesian active learning with image data. In International conference on machine learning, pages 1183–1192. PMLR, 2017.
  10. Bayesian active learning for classification and preference learning. arXiv:1112.5745, 2011.
  11. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  12. Accelerating deep learning by focusing on the biggest losers. arXiv:1910.00762, 2019.
  13. A sequential algorithm for training text classifiers. In SIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, pages 3–12. Springer, 1994.
  14. Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
  15. Statistical decision theory. In Statistical Decision Theory: Estimation, Testing, and Selection, pages 1–52. Springer, 2008.
  16. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. Pattern Analysis and Machine Intelligence (PAMI), 2022.
  17. Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. The Journal of Machine Learning Research, 23(1):7970–8014, 2022.
  18. The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313–2335, 2021.
  19. A statistical perspective on algorithmic leveraging. In International conference on machine learning, pages 91–99. PMLR, 2014.
  20. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  21. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
  22. Burr Settles. Active learning. Morgan & Claypool, 2012. Volume 6 of synthesis lectures on artificial intelligence and machine learning.
  23. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  24. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992.
  25. Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628, 2018.
  26. Optimal subsampling with influence functions. Advances in neural information processing systems, 31, 2018.
  27. Regularized linear regression: A precise analysis of the estimation error. Proceedings of Machine Learning Research, 40:1683–1709, 2015.
  28. Aaad W van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
  29. Are all training examples created equal? an empirical study. arXiv:1811.12569, 2018.
  30. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6340–6347, 2020.
  31. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522):829–844, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Germain Kolossov (1 paper)
  2. Andrea Montanari (165 papers)
  3. Pulkit Tandon (5 papers)
Citations (12)

Summary

Towards a Statistical Theory of Data Selection Under Weak Supervision

The paper discusses the importance of data selection in the context of statistical estimation and machine learning, especially when dealing with large datasets where labeling or computational resources are limited. The authors propose a theoretical framework for understanding and optimizing data selection using a surrogate model, with the aim of retaining a smaller subset of samples to be used for training, referred to as weak supervision. The report covers both low-dimensional and high-dimensional asymptotic settings, providing insights into unbiased and biased data selection methodologies.

Key Insights and Findings

  1. Unbiased vs. Biased Data Selection: The paper highlights the limitations of unbiased data selection, where each sample is weighted inversely to its selection probability to ensure an unbiased estimate. The authors show that biased data selection schemes, which involve different weighting strategies, can outperform unbiased ones significantly. Particularly in low-dimensional settings, optimal biased selection can reduce generalization error substantially compared to its unbiased counterpart.
  2. Dependence on Subsampling Fraction: Traditional data selection methods typically do not account for the subsampling fraction when computing selection probabilities. This paper demonstrates that optimal selection probabilities should depend substantially on the target subsampling fraction. In both theory and experimental results, ignoring this dependence can lead to suboptimal performance.
  3. Role of Surrogate Models: The surrogate model plays a crucial role in determining which samples are worth retaining. However, simply plugging the surrogate model’s predictions into the selection process may not be optimal, as indicated by the constructed scenarios where a less accurate surrogate model can sometimes lead to better data selection.
  4. Practical Implications of Data Selection: The research uncovers several scenarios where data selection not only retains performance comparable to full data training but can even surpass it. This is shown through both theoretical derivation and empirical evidence, indicating that careful selection can lead to models with reduced test error compared to those trained on the full dataset.
  5. Theoretical and Empirical Validation: The theory stands up to rigorous quantitative analysis of various subsampling strategies, both in low and high-dimensional spaces. Simulations, including logistic regression experiments using synthetic and real-world data, validate the theoretical predictions, showing excellent alignment between theory and practice.

Theoretical Framework and Models

  • Low-Dimensional Regime: In this setting, the report clarifies how traditional statistical assumptions apply, focusing on scenarios where the number of dimensions is fixed as the sample size grows. The authors use asymptotic analysis to derive optimal subsampling techniques and explore the impact of these selections on the generalization error.
  • High-Dimensional Regime: As data grows in both samples and dimensions, the paper shifts its focus to generalized linear models, leveraging techniques like Gaussian comparisons to detail the behavior of subsampling strategies in this regime. The density of relevant derivatives and the impact of surrogate models are explored in depth.

Implications and Future Directions

The implications of this research extend both theoretically and practically:

  • Improved Algorithms: The insights from this paper can guide the development of more efficient learning algorithms that maintain high performance when trained on significantly reduced datasets, offering savings in computational cost and time.
  • Active Learning and Experimental Design: The strategies outlined can be applied to more complex settings within active learning and experimental design, where data selection is critical for achieving specific performance goals under resource constraints.
  • Broader Applications in AI: The results suggest potential applications in AI fields requiring efficient data utilization and quick model adaptation, including domains like autonomous driving where rapid decision-making with partial information is essential.

Future research may look into refining these methodologies, exploring how they interact with different model architectures and data distributions, and expanding on the integration of imperfect surrogate models through a deeper minimax analysis approach.