Deep Learning using Linear Support Vector Machines (1306.0239v4)

Published 2 Jun 2013 in cs.LG and stat.ML

Abstract: Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.

Citations (876)

View on Semantic Scholar

Summary

The paper demonstrates that replacing the softmax layer with a linear SVM consistently improves classification performance, achieving 0.87% error on MNIST and 11.9% on CIFAR-10.
It employs an L2-SVM with a margin-based loss and SGD optimization to effectively backpropagate gradients through deep network architectures.
Experimental results, including a facial expression recognition score of 71.2%, underscore the robustness of the SVM approach as a superior alternative to softmax.

Deep Learning using Linear Support Vector Machines

The paper "Deep Learning using Linear Support Vector Machines" by Yichuan Tang presents an approach that integrates linear Support Vector Machines (SVMs) with deep learning architectures, aiming to improve the performance of classification tasks. This paper investigates the efficacy of substituting the traditional softmax activation function with a linear SVM layer, applying a margin-based loss function instead of the conventional cross-entropy loss.

Key Contributions

The primary contribution of this paper is the demonstration of a consistent performance improvement achieved by replacing the softmax activation function with a linear SVM for the final classification layer in neural networks. This integration is tested across various well-known datasets, including MNIST, CIFAR-10, and a facial expression recognition competition hosted by ICML 2013, showing significant performance gains.

Methodology

Softmax vs. Linear SVM:
- The paper offers a clear comparison between the functioning of softmax layers and linear SVMs. Softmax is used to model probability distributions over classes, minimizing cross-entropy loss. In contrast, SVMs work on the principle of maximizing the margin between data points of different classes.
- The authors specifically employ L2-SVM, which minimizes the squared hinge loss and provides a differentiable objective function that is more efficient for backpropagation.
Implementation Details:
- The methodology involves replacing the softmax layer with a linear SVM top layer in deep neural networks, and optimizing these networks using standard backpropagation techniques where gradients from the SVM objective function are propagated through the network.
- Experiments include stochastic gradient descent (SGD) with momentum, minibatch processing, and careful hyperparameter tuning to optimize performance.

Experimental Results

The paper presents a compelling comparison of softmax versus linear SVM in deep learning models:

Facial Expression Recognition:
- The model with linear SVM achieved a private test score of 71.2%, surpassing the second-place team by nearly 2% in the ICML 2013 workshop on representation learning competition.
MNIST:
- When applied to the MNIST dataset, the model using L2-SVM achieved an error rate of 0.87%, indicating its effectiveness compared to the 0.99% error rate of the softmax model.
CIFAR-10:
- On CIFAR-10, the linear SVM model achieved a test error of 11.9%, significantly lower than the 14.0% test error of the model with softmax.

Implications and Future Work

The combination of SVMs with deep learning architectures offers a potentially more robust and regularized approach to classification tasks. The empirical evidence presented in this paper suggests that the superior regularization effect of the SVM loss function aids in better generalization compared to traditional softmax layers.

Future work can explore:

Alternative Multiclass SVM Formulations:
- Investigating different formulations and implementations of multiclass SVMs to further optimize performance.
Comprehensive Hyperparameter Tuning:
- Fine-tuning other hyperparameters such as dropout rates, initialization strategies, and learning schedules.
Extension to Other Network Architectures:
- Applying the SVM approach to other architectures such as recurrent neural networks (RNNs) and more complex convolutional networks to paper its effectiveness across a broader range of tasks.

In summary, this paper presents a valuable exploration into integrating SVMs with deep neural networks, providing a detailed comparative analysis that underscores the benefits of such an approach. The research highlights the potential for SVMs to serve as a robust alternative to softmax layers in neural network-based classification tasks.

PDF Markdown