Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks (2006.07322v5)

Published 12 Jun 2020 in cs.LG and stat.ML

Abstract: Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.

Citations (153)

View on Semantic Scholar

Summary

The paper demonstrates that square loss achieves comparable or superior performance to cross-entropy in 22 out of 28 classification tasks.
It shows that removing the softmax layer and tuning the learning rate and loss scaling are key to optimizing square loss performance.
The study highlights square loss's lower sensitivity to initialization, making it a robust alternative for training deep neural networks.

This paper "Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks" (2006.07322) presents a systematic empirical comparison of training deep neural networks for classification tasks using the square loss (also known as L2 loss or Brier score) versus the widely adopted cross-entropy loss. The core premise is to challenge the conventional belief that cross-entropy is empirically superior for classification, providing extensive evidence to suggest that square loss is competitive or even better in many scenarios.

The authors evaluated ten different modern neural architectures across ten standard benchmark datasets spanning NLP, Automatic Speech Recognition (ASR), and computer vision domains. For each task, they trained the specified architecture using both square loss and cross-entropy, largely keeping the hyperparameter settings the same as reported in literature optimized for cross-entropy, with specific adjustments noted for square loss training.

Key findings from the empirical paper include:

Performance: Square loss achieved comparable or better performance (measured by accuracy, F1 score, or error rate) in 22 out of 28 tested tasks. The advantage was particularly pronounced in NLP and ASR tasks. Cross-entropy showed a slight edge mainly in certain computer vision tasks, notably ImageNet with the EfficientNet architecture.
Computational Resources: When training with the square loss using the same number of epochs as cross-entropy (equating computational cost), the performance remained competitive, often matching or exceeding cross-entropy results.
Sensitivity to Initialization: Training with square loss demonstrated lower variance across different random initializations in the majority of experiments (21 out of 28 tasks), suggesting it can be less sensitive to the randomness inherent in the training process.
Softmax Layer: A crucial practical observation is that the final softmax layer, standard practice before the cross-entropy loss output, should be removed when training with the square loss. Including softmax significantly impeded optimization during their experiments.
Loss Rescaling for Many Classes: For datasets with a large number of output classes ( $\geq 42$ ), the authors found that a simple rescaling of the square loss helped accelerate training and improve performance. The rescaled square loss is defined as:

${l}_{s} = \frac{1}{C}\left(k*(f_c(\bm{x})-M)^2+\sum_{i=1, i\neq c}^Cf_i(\bm{x})^2\right),$

where $f(\bm{x})$ is the network output vector, $c$ is the true class index, $\bm{y}$ is the one-hot encoding, $C$ is the number of classes, $k$ rescales the loss for the true class output, and $M$ effectively rescales the target value for the true class (from 1 to $M$ ). For datasets with $\ge 42$ classes, they used $k=1$ and $M=15$ or $k=15$ and $M=30$ . For datasets with fewer classes, $k=M=1$ , which reverts to the standard square loss. This rescaling is similar in spirit to proposals for improving the optimization landscape of square loss in multiclass settings.
Learning Rate Adjustment: While other hyperparameters were mostly kept the same, the learning rate for square loss often needed to be adjusted (typically increased) compared to the value used for cross-entropy.

The paper argues that the historical preference for cross-entropy over square loss in classification might stem from outdated theoretical arguments (like surrogate loss properties in simpler models) or intuitions that don't fully apply to modern over-parameterized deep networks. They also question the reliability of probability outputs from cross-entropy trained networks, noting that calibration issues are common.

In terms of practical implementation, developers looking to apply these findings should consider:

Replacing Cross-Entropy: Substitute the cross-entropy loss function with the mean squared error loss.
Removing Softmax: Ensure the final softmax activation layer before the loss calculation is removed.
Learning Rate Tuning: Be prepared to tune the learning rate, as the optimal value for square loss may differ from cross-entropy.
Loss Rescaling: For tasks with a large number of classes, implement the proposed simple rescaling mechanism with parameters $k$ and $M$ to potentially improve training speed and final performance. The specific values suggested in the paper can serve as a starting point.
Hyperparameter Re-evaluation: While the paper used hyperparameters tuned for cross-entropy, further performance gains might be possible by optimizing other hyperparameters specifically for the square loss.

The paper concludes by suggesting that training with square loss should be considered a standard best practice for deep learning classification tasks, on equal footing with, or potentially preferred over, cross-entropy, particularly in domains like NLP and ASR.