Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning (2012.10630v4)

Published 19 Dec 2020 in cs.LG and cs.AI

Abstract: Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. Next, we propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates and can be applied to any loss-based learning algorithm. We then show that for a rich class of loss functions including cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete data selection is an instance of (weakly) submodular optimization, and we analyze conditions for which Glister-Online reduces the validation loss and converges. Finally, we propose Glister-Active, an extension to batch active learning, and we empirically demonstrate the performance of Glister on a wide range of tasks including, (a) data selection to reduce training time, (b) robust learning under label noise and imbalance settings, and (c) batch-active learning with several deep and shallow models. We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms in case (b).

Generalization Based Data Subset Selection for Efficient and Robust Learning

The paper, "Generalization Based Data Subset Selection for Efficient and Robust Learning," addresses the crucial challenge of optimizing machine learning and deep learning processes by efficiently selecting training data subsets that enhance the robustness and performance of models. The research focuses on balancing efficiency and robustness within machine learning systems, aiming to mitigate the high computational demands of large-scale models and address real-world data issues such as noise and class imbalances.

Key Contributions

The authors introduce a novel framework, GLISTER (GeneraLIzation based data Subset selecTion for Efficient and Robust learning), which formulates the subset selection problem as a mixed discrete-continuous bi-level optimization problem. This approach selects a subset from the training data that maximizes the validation set's log-likelihood, thereby encouraging generalization while maintaining robustness to data irregularities. The framework is empirically validated across various tasks including efficiency improvements, robustness under noisy label environments, and active learning.

Special Cases and Theoretical Insights

The paper thoroughly analyzes GLISTER with classical models like Naive Bayes, k-nearest neighbors, and linear regression, exhibiting connections to submodular optimization—a property often exploited for efficient data selection due to its diminishing returns nature. For models using negative logistic-loss, hinge-loss, squared-loss, and logistic-loss, the data selection optimization problem inherits submodularity, implying efficient approximations can be constructed via greedy or stochastic greedy algorithms.

Empirical Evaluation

The empirical validation of GLISTER demonstrates substantial improvements in computational efficiency, achieving up to 6x speedups over full-training runs while maintaining comparable accuracy. The robustness is evidenced by enhanced model performance in scenarios plagued by label noise and class imbalances. Interestingly, GLISTER's data selection properties enable it, in some cases, to outperform full-model training on noisy datasets, highlighting its efficacy in robust learning environments.

Implementation and Practical Implications

By introducing parameters such as subset size (k), frequency of data selection rounds (L), and regularization coefficients (λ), the framework provides flexibility across diverse application scenarios. The iterative approach leverages Taylor series approximations to reduce computational overhead, making GLISTER scalable and practical for use with large datasets and deep models, despite the computational challenges inherent in submodular function optimization.

Future Directions

Exploring the extension of GLISTER for scenarios involving distribution shifts, GLISTER-Active is proposed for batch active learning, incorporating hypothesized labels from model predictions. This variant shows promising results against established batch active learning algorithms, suggesting potential broader applicability, especially in domains where labeled data is scarce or costly.

In summary, this paper presents a comprehensive and experimentally validated framework that bridges efficiency and robustness, offering valuable insights and tools for enhancing large-scale machine learning model performance across diverse real-world datasets. GLISTER sets a practical precedent for future research aimed at maturing the synergy between efficient data utilization and robust machine learning model development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Krishnateja Killamsetty (17 papers)
  2. Durga Sivasubramanian (8 papers)
  3. Ganesh Ramakrishnan (88 papers)
  4. Rishabh Iyer (70 papers)
Citations (176)