Towards a statistical theory of data selection under weak supervision (2309.14563v2)
Abstract: Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples ${{\boldsymbol x}i}{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by ${{\boldsymbol x}i}{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
- High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
- Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2):749–772, 2021.
- Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
- Influential observations, high leverage points, and outliers in linear regression. Statistical science, pages 379–393, 1986.
- Unsupervised learning of visual features by contrasting cluster assignments, 2021.
- Large deviations in the perceptron model and consequences for active learning. Machine Learning: Science and Technology, 2(4):045001, 2021.
- Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127–1136, 2006.
- Faster least squares approximation. Numerische mathematik, 117(2):219–249, 2011.
- Deep bayesian active learning with image data. In International conference on machine learning, pages 1183–1192. PMLR, 2017.
- Bayesian active learning for classification and preference learning. arXiv:1112.5745, 2011.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
- Accelerating deep learning by focusing on the biggest losers. arXiv:1910.00762, 2019.
- A sequential algorithm for training text classifiers. In SIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, pages 3–12. Springer, 1994.
- Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
- Statistical decision theory. In Statistical Decision Theory: Estimation, Testing, and Selection, pages 1–52. Springer, 2008.
- KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. Pattern Analysis and Machine Intelligence (PAMI), 2022.
- Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. The Journal of Machine Learning Research, 23(1):7970–8014, 2022.
- The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313–2335, 2021.
- A statistical perspective on algorithmic leveraging. In International conference on machine learning, pages 91–99. PMLR, 2014.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
- Burr Settles. Active learning. Morgan & Claypool, 2012. Volume 6 of synthesis lectures on artificial intelligence and machine learning.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992.
- Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628, 2018.
- Optimal subsampling with influence functions. Advances in neural information processing systems, 31, 2018.
- Regularized linear regression: A precise analysis of the estimation error. Proceedings of Machine Learning Research, 40:1683–1709, 2015.
- Aaad W van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
- Are all training examples created equal? an empirical study. arXiv:1811.12569, 2018.
- Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6340–6347, 2020.
- Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522):829–844, 2018.
- Germain Kolossov (1 paper)
- Andrea Montanari (165 papers)
- Pulkit Tandon (5 papers)