Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters (1511.06727v3)

Published 20 Nov 2015 in cs.LG

Abstract: Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance. We propose a gradient-based approach for locally adjusting hyperparameters during training of the model. Hyperparameters are adjusted so as to make the model parameter gradients, and hence updates, more advantageous for the validation cost. We explore the approach for tuning regularization hyperparameters and find that in experiments on MNIST, SVHN and CIFAR-10, the resulting regularization levels are within the optimal regions. The additional computational cost depends on how frequently the hyperparameters are trained, but the tested scheme adds only 30% computational overhead regardless of the model size. Since the method is significantly less computationally demanding compared to similar gradient-based approaches to hyperparameter optimization, and consistently finds good hyperparameter values, it can be a useful tool for training neural network models.

Authors (4)

Jelena Luketina (8 papers)
Mathias Berglund (8 papers)
Klaus Greff (32 papers)
Tapani Raiko (17 papers)

Citations (169)

View on Semantic Scholar

Summary

The paper introduces a novel T1-T2 technique that concurrently optimizes model parameters and hyperparameters using stochastic gradient descent.
It achieves computational efficiency by approximating hypergradients in real-time, bypassing costly inverse Hessian computations.
The method robustly tunes hyperparameters across architectures and datasets, preventing validation overfitting and setting a baseline for adaptive training.

Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters

The paper "Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters" by Luketina et al. presents a novel methodology for hyperparameter optimization in neural networks. Hyperparameter selection traditionally involves running multiple full trainings, relying on validation set performance to determine the optimal values. This process is computationally intensive and becomes exponentially demanding with an increase in the number of hyperparameters. The authors propose an alternative approach using gradient-based optimization, which facilitates real-time hyperparameter tuning during model training.

Methodology Overview

The innovative aspect of the proposed method, termed $T_1-T_2$ , is the concurrent optimization of both model parameters and hyperparameters using stochastic gradient descent. Here, $T_1$ refers to the training set and $T_2$ to the validation set used exclusively for hyperparameter updates. The method extends the optimization process to encompass hyperparameters by treating them analogously to model parameters. Elementary parameter gradients are computed to adjust model weights, while hypergradient updates aim to enhance the validation cost, thereby finding an optimal set of hyperparameters in a single training run.

A significant advantage of the $T_1-T_2$ approach is its computational efficiency. Unlike previous methods that rely on inverse Hessian computations or gradient propagation through the entire history of parameter updates, $T_1-T_2$ employs a simplified approximation that circumvents these costly operations. This simplification enables the method to be applied effectively to modern deep learning models, which typically have millions of parameters.

Experimental Results

The authors conduct extensive experiments with various neural network architectures, including MLPs and CNNs, across datasets such as MNIST, SVHN, and CIFAR-10. Regularization techniques explored include Gaussian noise applied to inputs and hidden layers, as well as L2 weight penalties. The method consistently identified hyperparameters within optimal ranges, even with different initial values, suggesting robustness in directing training towards lower log-likelihood regions.

A striking observation from the experiments is the lack of overfitting to validation sets, even with numerous hyperparameters. This indicates that the validation performance remains indicative of generalization to unseen test sets. Additionally, $T_1-T_2$ demonstrates utility in cases where practitioners lack strong intuitions regarding initial hyperparameter values, presenting an effective baseline for further finetuning.

Implications for Future Research

The $T_1-T_2$ method presents promising directions for advancing hyperparameter optimization practices. Particularly, it broadens the feasibility of tuning hyperparameters using continuous representations. The approach could potentially extend beyond regularization hyperparameters to areas such as dynamically adjusting network architectures during training. Furthermore, the paper opens avenues for exploring complex neural network setups where hyperparameter values dynamically influence other network characteristics such as layer contributions.

In conclusion, the paper proposes a practical, efficient technique for gradient-based hyperparameter tuning, providing a significant reduction in computational overheads while maintaining the reliability of results. This method could serve as a foundational tool for further research into adaptive model training methodologies, shaping future developments in neural network optimization.

PDF Markdown