Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt (2206.07137v3)

Published 14 Jun 2022 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.

Authors (11)

Sören Mindermann (20 papers)
Jan Brauner (9 papers)
Muhammed Razzak (6 papers)
Mrinank Sharma (17 papers)
Andreas Kirsch (30 papers)
Winnie Xu (12 papers)
Benedikt Höltgen (7 papers)
Aidan N. Gomez (16 papers)
Adrien Morisot (8 papers)
Sebastian Farquhar (31 papers)
Yarin Gal (170 papers)

Citations (123)

View on Semantic Scholar

Summary

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

The research paper "Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt" introduces Reducible Holdout Loss Selection (RHO-LOSS), a novel data selection technique aimed at optimizing the training efficiency of large-scale neural network models. The methodology addresses the inefficiencies inherent in existing training paradigms that predominantly focus on uniform sampling of data, often resulting in computational redundancy and extended training times.

Core Proposal: RHO-LOSS

The principal contribution of this paper is the RHO-LOSS algorithm, which strategically selects data points for training based on their potential to reduce the generalization loss of the model. This approach stands in contrast to traditional methods that either emphasize "hard" examples (which may be noisy and not task-relevant) or "easy" examples (which become redundant once learned). Instead, RHO-LOSS identifies and focuses on data points that are learnable, worthwhile, and not yet mastered by the model, thereby proposing a more efficient training process.

Methodological Insights

RHO-LOSS employs a theoretically grounded selection function derived from principles of probabilistic modeling. This function estimates the reducibility of holdout loss for each data point, leveraging a balance between current training losses and an estimated irreducible holdout loss. By doing so, the algorithm prioritizes data points that maximize the reduction in holdout loss, thereby optimizing learning with fewer training iterations.

To facilitate practical implementation, the paper proposes approximations that make this method computationally feasible for large-scale deep learning applications. These approximations include substituting Bayesian inference with gradient descent, approximating holdout dataset models, and employing smaller irreducible loss models to cut down compute costs.

Empirical Evidence and Results

Through extensive experiments across diverse task domains, including vision (e.g., CIFAR-10, Clothing-1M) and natural language processing (e.g., CoLA, SST-2), RHO-LOSS demonstrated substantial reductions in training steps compared to uniform sampling and other data selection methods. For instance, on the noisy, web-scraped Clothing-1M dataset, RHO-LOSS achieved a speedup of training by a factor of 18x while also enhancing final model accuracy by 2%.

Implications and Future Directions

The introduction of RHO-LOSS has significant implications for both the theoretical understanding and practical application of machine learning models. By providing a framework to systematically prioritize data by its contribution to generalization, this work underscores the value of intelligent data selection in expediting training and improving model performance.

Future research directions may involve further refinement of irreducible loss estimations and exploration of RHO-LOSS in conjunction with other model optimization techniques. Additionally, integrating the selection algorithm into more sophisticated parallel computation systems could further enhance its utility, potentially influencing training paradigms at greater scales and across wider application areas.

The paper provides a foundational step towards more cost-effective and time-efficient learning, promoting a paradigm shift in tackling the challenges posed by vast, noisy datasets typical in today's AI applications.