Optimizing Loss Functions Through Multivariate Taylor Polynomial Parameterization (2002.00059v4)

Published 31 Jan 2020 in cs.LG, cs.NE, and stat.ML

Abstract: Metalearning of deep neural network (DNN) architectures and hyperparameters has become an increasingly important area of research. Loss functions are a type of metaknowledge that is crucial to effective training of DNNs, however, their potential role in metalearning has not yet been fully explored. Whereas early work focused on genetic programming (GP) on tree representations, this paper proposes continuous CMA-ES optimization of multivariate Taylor polynomial parameterizations. This approach, TaylorGLO, makes it possible to represent and search useful loss functions more effectively. In MNIST, CIFAR-10, and SVHN benchmark tasks, TaylorGLO finds new loss functions that outperform functions previously discovered through GP, as well as the standard cross-entropy loss, in fewer generations. These functions serve to regularize the learning task by discouraging overfitting to the labels, which is particularly useful in tasks where limited training data is available. The results thus demonstrate that loss function optimization is a productive new avenue for metalearning.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces TaylorGLO, a metalearning method that parameterizes loss functions using multivariate Taylor expansions and optimizes them with CMA-ES.
It demonstrates that the evolved loss functions outperform standard cross-entropy by achieving higher accuracy with fewer evaluations and enhanced robustness.
Experiments on MNIST, CIFAR-10, and SVHN validate the approach, showing smooth, well-behaved loss landscapes that improve model regularization and generalization.

This paper, "Optimizing Loss Functions Through Multi-Variate Taylor Polynomial Parameterization" (Optimizing Loss Functions Through Multivariate Taylor Polynomial Parameterization, 2020), introduces TaylorGLO, a novel metalearning method for automatically discovering effective loss functions for deep neural networks. The core idea is to parameterize loss functions using multivariate Taylor expansions and optimize these parameters using continuous optimization techniques.

Problem: Manual design and tuning of loss functions are challenging and time-consuming. While metalearning techniques exist for architecture and hyperparameter search, loss function optimization is a relatively new area. Previous approaches, like Genetic Loss Optimization (GLO) [gonzalez2019glo], used genetic programming on tree-based representations, which suffered from discrete search spaces, leading to discontinuities, unstable functions, and inefficient search requiring many evaluations. There was a need for a more continuous and well-behaved parameterization for loss functions to enable more efficient optimization.

Proposed Solution (TaylorGLO): TaylorGLO addresses this by representing loss functions as multivariate Taylor expansions. A $k$ th-order Taylor expansion of a function with $n$ inputs can be parameterized by a vector $#1{\theta}$ containing coefficients corresponding to the partial derivatives at a chosen center point, along with the coordinates of the center point itself. For a classification loss function involving network output $y_i$ and true label $x_i$ , the function $f(x_i, y_i)$ within the standard $\mathcal{L} = -\frac{1}{N}\sum f(x_i, y_i)$ structure is replaced by its Taylor approximation. The coefficients of this Taylor polynomial become the parameters that are optimized. To ensure the loss function is useful and depends on the prediction, terms that do not contribute to the gradient with respect to the output $y_i$ are trimmed from the expansion.

Because this parameterization results in a continuous search space, TaylorGLO leverages Covariance-Matrix Adaptation Evolutionary Strategy (CMA-ES) [hansen1996cmaes], a powerful black-box continuous optimization algorithm. CMA-ES maintains a distribution over the parameter space and iteratively adapts it based on the performance (fitness) of sampled candidate loss functions. The fitness of a candidate loss function is determined by training a neural network with that loss function for a limited number of steps and evaluating its accuracy on a validation dataset. The search starts from a zero vector for the parameters, representing an initially unbiased function.

Benefits of Taylor Parameterization: Using multivariate Taylor expansions for loss function parameterization offers several advantages:

Smoothness: The resulting loss functions are guaranteed to be smooth.
No Poles: Polynomials do not have discontinuities (poles) within their domain, making the resulting loss functions well-behaved.
Simple Implementation: They can be implemented using only addition and multiplication.
Trivial Differentiation: Gradients are easily calculated, which is essential for training neural networks.
Smooth Search Space: Nearby parameter vectors correspond to similar loss functions, making the optimization landscape easier to search.
Efficiency and Reliability: Valid and useful loss functions are found more frequently and in fewer generations compared to tree-based methods.
Tunable Complexity: The complexity of the loss function can be controlled by the order of the Taylor expansion ( $k$ ).

Implementation Details:

Datasets: Experiments were conducted on MNIST, CIFAR-10, and SVHN datasets.
Models: Various standard CNN architectures were used, including a Basic CNN, AlexNet [NIPS2012_4824], AllCNN-C [allcnn], Preactivation ResNet-20 [preresnet], and Wide ResNets [wideresnet]. Some experiments also included data augmentation with Cutout [cutout].
TaylorGLO Setup: CMA-ES was configured with specific population sizes ( $\lambda=28$ for MNIST, $\lambda=20$ for CIFAR-10/SVHN) and an initial step size. A third-order ( $k=3$ ) Taylor expansion was primarily used, found to offer a good balance between performance and evolution time.
Candidate Evaluation: Neural networks were trained with candidate loss functions for a partial duration (e.g., 2,000 steps for MNIST, which is 10% of a full training run). This reduces the time required for each fitness evaluation. Experiments showed that this partial evaluation is sufficient and even more sample-efficient than full training runs for evaluation.
Computational Setup: Training was distributed across a GPU cluster using TensorFlow [tensorflow], while the CMA-ES optimization ran centrally using Swift.

Results:

Discovery Process: TaylorGLO efficiently discovers high-performing loss functions, often converging within tens of generations (e.g., within 20 generations on MNIST). The population of candidate functions converges over time in the parameter space (Figure 7), leading to more consistent high-performing functions.
Evolved Loss Function Shape: The evolved loss functions exhibit surprising shapes, particularly on MNIST (Figure 3). Instead of monotonically decreasing like cross-entropy, the evolved functions show an increase in loss near the correct output class prediction. This counter-intuitive shape is hypothesized to provide an implicit regularization effect by discouraging the model from outputting overly confident, extreme values, thus preventing overfitting.
Performance: TaylorGLO-discovered loss functions consistently outperform the standard cross-entropy loss across various datasets and architectures (Table 1). For example, on MNIST, the best TaylorGLO loss achieved 0.9951 mean test accuracy, significantly better than cross-entropy's 0.9899 and slightly better than the previous GLO best (BaikalCMA) at 0.9947. Crucially, TaylorGLO achieved this superior performance with significantly fewer partial training evaluations (e.g., 448 evaluations for the top MNIST function) compared to GLO (over 11,000 evaluations).
Regularization and Robustness: Evolved loss functions improve performance even when combined with other regularization techniques like Cutout, suggesting a distinct mechanism. Analysis of the loss landscape (Figure 5) shows that models trained with TaylorGLO losses have flatter, lower accuracy basins, indicating greater robustness and better generalization compared to cross-entropy.
Reduced Datasets: The performance gains from TaylorGLO loss functions are particularly pronounced on reduced datasets (Figure 6), further supporting the hypothesis that they provide effective regularization against overfitting when data is limited.

Discussion and Future Work: The success of TaylorGLO highlights the potential of optimizing loss functions as a form of metalearning. The surprising shapes of the evolved functions suggest that automatic discovery can yield non-obvious solutions that outperform human-designed ones. Future work could explore co-evolving loss functions and network architectures, incorporating state information (e.g., training progress, batch statistics, gradient information) into the loss function definition to make it dynamic, or applying the technique to different domains and problem types.

Conclusion: TaylorGLO demonstrates a promising new direction for metalearning by using a continuous parameterization of loss functions via multivariate Taylor expansions and optimizing them with CMA-ES. This approach is more efficient and reliable than previous discrete methods. The discovered loss functions provide a distinct form of regularization that improves performance, especially on reduced datasets, and results in more robust models, making TaylorGLO a valuable tool for automating machine learning model development.

PDF Markdown

Optimizing Loss Functions Through Multivariate Taylor Polynomial Parameterization (2002.00059v4)

Summary

Related Papers