Online Batch Selection for Faster Training of Neural Networks

Published 19 Nov 2015 in cs.LG, cs.NE, and math.OC | (1511.06343v4)

Abstract: Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.

Abstract PDF Upgrade to Chat

Citations (285)

View on Semantic Scholar

Summary

The paper introduces an online loss-ranked batch selection method that dynamically prioritizes high-loss datapoints to accelerate convergence.
It integrates the approach with AdaDelta and Adam, achieving up to a fivefold speedup in training on MNIST without sacrificing accuracy.
Adjusting the selection pressure with an exponential decay scheme ensures stability and mitigates risks of overfitting in extended training regimes.

Online Batch Selection for Faster Training of Neural Networks: An Overview

In the paper titled "Online Batch Selection for Faster Training of Neural Networks," Loshchilov and Hutter present a novel methodology aimed at improving the efficiency of training deep neural networks (DNNs) by dynamically selecting training datapoints. The proposed approach targets the computational bottlenecks encountered during stochastic gradient descent (SGD) optimization, specifically focusing on the selection of minibatches.

Summary of Methodology

The paper investigates online batch selection strategies in the context of two renowned optimization algorithms: AdaDelta and Adam. The crux of the proposed method is the utilization of loss-ranked batch selection. Instead of the conventional method of iterating over shuffled datapoints, this method determines the probability of each datapoint being included in the batch based on its rank according to the latest computed loss. Specifically, the probability decays exponentially with the rank, introducing a controlled selection pressure that favors datapoints with higher losses more frequently.

This selection mechanism is strategically designed to accelerate convergence by concentrating computational resources on datapoints that contribute significantly to the objective function, thereby reducing the iterations needed to achieve optimal parameter adjustments.

Numerical Results and Implications

Experimental results conducted on the MNIST dataset indicate substantial speedups. Both AdaDelta and Adam demonstrated quintupling in training speed using this batch selection process. The results suggest that the approach not only maintains robustness but also exploits datapoint importance effectively to enhance training velocity without compromising accuracy.

The findings also revealed that over-aggressive selection pressures could potentially lead to overfitting or unstable convergence in long training regimes. Therefore, the authors propose a decay scheme for the selection pressure, improving stability.

Theoretical and Practical Implications

From a theoretical perspective, this research contributes to the understanding of non-uniform sampling in stochastic approximation contexts, providing a parameterizable framework for selecting critical datapoints during training. The enforcement of rank-based selection aligns well with reinforcement learning frameworks, proposing a gradient of datapoint importance that can evolve during training to mitigate noise and optimize resource allocation.

Practically, the implications for accelerating DNN training are profound, particularly in resource-constrained environments or where rapid prototyping and iteration cycles are vital. Enhanced training efficiency could lead to more responsive models deployed over large datasets, potentially widening the accessibility of complex algorithms to broader applications.

Future Directions

The results invite further exploration into adaptive batch selection methodologies, suggesting several potential avenues:

Adaptive Strategies: Development of adaptive selection pressures that adjust dynamically based on convergence metrics and performance indicators, possibly integrating recent advances in adaptive learning rate methodologies.
Extended Validation: Application of the batch selection framework across diverse datasets beyond MNIST to evaluate generalizability and robustness in more complex problem domains.
Integration with Other Techniques: Exploring synergies between this method and other optimization accelerations, such as curriculum learning or learning rate adaptations, to compound efficiency gains.

By fostering a deeper understanding of the intrinsic dynamics of datapoint selection and optimization, this research lays a foundation for future studies aiming to refine and expand the capabilities of modern DNN training protocols. The insights from this paper could significantly influence the development of more efficient neural architectures, reflecting both on computational efficiency and design paradigms in the field.

Markdown