Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond (2402.17327v1)

Published 27 Feb 2024 in cs.LG and cs.DS

Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

The paper "Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond" addresses the challenge of training machine learning models with large datasets efficiently by introducing a novel data selection method. This approach leverages kk-means clustering and sensitivity sampling to enable the selection of smaller representative subsets of data, facilitating the training of machine learning models in a resource-efficient manner.

Fundamental Propositions

The authors explore the concept of data selection where the goal is to determine a succinct subset of the data that can faithfully represent the entire dataset during model training. The paper proposes a methodology that intertwines kk-means clustering and sensitivity sampling, with theoretical backing, ensuring the selection of a set of typical elements from the dataset. The loss for this subset mirrors the overall dataset's average loss within a controllable error margin defined by an additive term influenced by the kk-means clustering cost and a multiplicative factor.

The authors assert that their method, when applied to fine-tuning foundation models, surpasses contemporary strategies both in performance and scalability. They demonstrate that this approach can also be extended to linear regression tasks with results rivalling leverage score sampling methods, while presenting conceptual simplicity and better scalability.

Theoretical Analysis and Algorithmic Approach

The paper introduces an innovative algorithmic framework through the concept of H\"older continuity for model loss, a less stringent assumption than Lipschitz continuity applied by prior research. The theoretical foundation of this work suggests that the approach is robust across various applications, including foundation models, supporting the generalization of this data-selection methodology.

Specifically, the authors contribute the following advancements:

  1. A systematic approach that allows for more robust selection of data points mitigating issues associated with outliers.
  2. Demonstration of strong theoretical results with minimal assumptions on dataset embeddings, allowing broader applicability across tasks.
  3. Empirical evidence showcasing the method's efficacy in comparison to traditional methods on benchmark datasets such as MNIST and for fine-tuning LLMs.

Numerical Results and Practical Implications

The empirical results are robust, highlighting the practicality of this method in scenarios demanding efficiency. For instance, results reveal a noticeable improvement in fine-tuning a T5-Small model for machine translation tasks, using representative data subsets. Accuracy improvements are highlighted against the baseline methods such as random sampling and state-of-the-art data selection approaches.

The paper's methodology, when applied to linear regression, provides competitive outcomes to advanced techniques such as leverage score sampling. This signals a significant contribution to the active learning domain, suggesting that clustering-based sensitivity-driven sampling could become a standard procedure for data-efficient learning in varied contexts.

Future Directions

Given the promising results, further investigation could be directed towards extending this approach to a broader range of machine learning settings, exploring hyperparameter optimization techniques in the clustering process, or evaluating this method within the field of unsupervised or semi-supervised learning frameworks. Additionally, integrating this data selection methodology with more sophisticated models could yield insights into the scalability limits and further efficiencies.

In summary, the paper effectively bridges theoretical underpinnings and practical implications, empowering data-efficient model training through clustering-based approaches. The proposed method offers a valuable trajectory towards optimized resource usage without compromising the performance of machine learning models, marking a significant step in the quest for practical and efficient AI solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Understanding the effects of batching in online active learning. In International Conference on Artificial Intelligence and Statistics, pp.  3482–3492. PMLR, 2020.
  2. k-means++: the advantages of careful seeding. In Bansal, N., Pruhs, K., and Stein, C. (eds.), Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pp.  1027–1035. SIAM, 2007. URL http://dl.acm.org/citation.cfm?id=1283383.1283494.
  3. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1119–1127, 2018.
  4. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp.  12–58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W14/W14-3302.
  5. Brinker, K. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th international conference on machine learning (ICML-03), pp.  59–66, 2003.
  6. Adaptive batch mode active learning. IEEE transactions on neural networks and learning systems, 26(8):1747–1760, 2014.
  7. Query complexity of least absolute deviation regression via robust uniform convergence. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp.  1144–1179. PMLR, 2021.
  8. Active regression via linear-sample sparsification. In Beygelzimer, A. and Hsu, D. (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp.  663–695. PMLR, 2019.
  9. Batch active learning at scale. Advances in Neural Information Processing Systems, 34, 2021.
  10. Fast and accurate $k$-means++ via rejection sampling. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/babcff88f8be8c4795bd6f0f8cccca61-Abstract.html.
  11. Parallel and efficient hierarchical k-median clustering. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  20333–20345, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/aa495e18c7e3a21a4e48923b92048a61-Abstract.html.
  12. A new coreset framework for clustering. In Symposium on Theory of Computing, STOC, pp.  169–182, 2021b. doi: 10.1145/3406325.3451022. URL https://doi.org/10.1145/3406325.3451022.
  13. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
  14. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
  15. Dasgupta, S. Analysis of a greedy active learning strategy. Advances in neural information processing systems, 17, 2004.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. A convex optimization framework for active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp.  209–216, 2013.
  18. Adaptivity in adaptive submodularity. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp.  1823–1846. PMLR, 2021.
  19. A unified framework for approximating and clustering data. In Symposium on Theory of Computing, STOC, pp.  569–578, 2011.
  20. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
  21. Efficient active learning of halfspaces: an aggressive approach. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.  480–488, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/gonen13.html.
  22. Interactive submodular set cover. arXiv preprint arXiv:1002.3345, 2010.
  23. Discriminative batch mode active learning. In NIPS, pp.  593–600. Citeseer, 2007.
  24. Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp.  353–360, 2007.
  25. On coresets for k-means and k-median clustering. In Symposium on Theory of Computing, STOC, pp.  291–300, 2004.
  26. Analysis of the greedy approach in problems of maximum k-coverage. Naval Research Logistics (NRL), 45(6):615–627, 1998.
  27. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd international conference on Machine learning, pp.  417–424, 2006.
  28. On optimal coreset construction for euclidean (k,z)𝑘𝑧(k,z)( italic_k , italic_z )-clustering, 2023.
  29. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pp.  2372–2379. IEEE, 2009.
  30. Multi-class batch-mode active learning for image classification. In 2010 IEEE international conference on robotics and automation, pp.  1873–1878. IEEE, 2010.
  31. Active learning with gaussian processes for object categorization. In 2007 IEEE 11th international conference on computer vision, pp.  1–8. IEEE, 2007.
  32. Learning multiple layers of features from tiny images. 2009.
  33. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  34. On the number of real roots of a random algebraic equation. ii. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 35, pp.  133–148. Cambridge University Press, 1939.
  35. Fast and accurate least-mean-squares solvers for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 44(12):9977–9994, 2022. doi: 10.1109/TPAMI.2021.3139612. URL https://doi.org/10.1109/TPAMI.2021.3139612.
  36. The online median problem. SIAM J. Comput., 32(3):816–832, 2003. doi: 10.1137/S0097539701383443. URL https://doi.org/10.1137/S0097539701383443.
  37. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  38. Active linear regression for ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norms and beyond. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pp. 744–753. IEEE, 2022.
  39. Data-independent neural pruning via coresets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1gmHaEKwB.
  40. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  41. L1 regression with lewis weights subsampling. In Wootters, M. and Sanità, L. (eds.), Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2021, August 16-18, 2021, University of Washington, Seattle, Washington, USA (Virtual Conference), volume 207 of LIPIcs, pp. 49:1–49:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  42. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, 2019. URL http://arxiv.org/abs/1910.10683.
  43. A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40, 2021.
  44. On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemometrics and Intelligent Laboratory Systems, 130:123–134, 2014.
  45. Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, 2:441–448, 2001.
  46. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12, pp.  171–187. Springer, 2019.
  47. Active learning for convolutional neural networks: A core-set approach. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=H1aIuk-RW.
  48. Settles, B. Active learning literature survey. 2009.
  49. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  50. On coresets for support vector machines. In Chen, J., Feng, Q., and Xu, J. (eds.), Theory and Applications of Models of Computation, 16th International Conference, TAMC 2020, Changsha, China, October 18-20, 2020, Proceedings, volume 12337 of Lecture Notes in Computer Science, pp.  287–299. Springer, 2020a. doi: 10.1007/978-3-030-59267-7_25. URL https://doi.org/10.1007/978-3-030-59267-7_25.
  51. Coresets for near-convex functions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  997–1009. Curran Associates, Inc., 2020b. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/0afe095e81a6ac76ff3f69975cb3e7ae-Paper.pdf.
  52. Pruning neural networks via coresets and convex geometry: Towards no assumptions. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/f7fc38fdd95fd146a471791b93ff9f12-Abstract-Conference.html.
  53. Provable data subset selection for efficient neural networks training. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  34533–34555. PMLR, 2023. URL https://proceedings.mlr.press/v202/tukan23a.html.
  54. Vergara, A. Gas Sensor Array Drift Dataset. UCI Machine Learning Repository, 2012. DOI: https://doi.org/10.24432/C5RP6W.
  55. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
  56. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):1–23, 2015.
  57. Submodularity in data subset selection and active learning. In ICML, pp.  1954–1963. PMLR, 2015.
  58. New subset selection algorithms for low rank approximation: Offline and online. In Saha, B. and Servedio, R. A. (eds.), Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pp.  1802–1813. ACM, 2023.
  59. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  60. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113:113–127, 2015.
  61. Active learning via transductive experimental design. In Proceedings of the 23rd international conference on Machine learning, pp.  1081–1088, 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kyriakos Axiotis (16 papers)
  2. Vincent Cohen-Addad (88 papers)
  3. Monika Henzinger (127 papers)
  4. Sammy Jerome (5 papers)
  5. Vahab Mirrokni (153 papers)
  6. David Saulpic (21 papers)
  7. David Woodruff (27 papers)
  8. Michael Wunder (3 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com