mixup: Beyond Empirical Risk Minimization (1710.09412v2)

Published 25 Oct 2017 in cs.LG and stat.ML

Abstract: Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples. In this work, we propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

Citations (8,974)

View on Semantic Scholar

Summary

The paper introduces mixup, a novel augmentation method that creates virtual samples by linearly interpolating inputs and labels to improve generalization.
It mitigates overfitting by encouraging smoother decision boundaries and demonstrates significant improvements on datasets like CIFAR and ImageNet.
Mixup is easy to implement with minimal computational overhead and enhances robustness against adversarial attacks and label noise.

The paper "mixup: Beyond Empirical Risk Minimization" (mixup: Beyond Empirical Risk Minimization, 2017) introduces a simple yet powerful data augmentation technique called mixup, designed to improve the generalization performance and robustness of neural networks trained with the standard Empirical Risk Minimization (ERM) principle.

The Problem:

Traditional deep learning models trained via ERM minimize the average error on the training data. While effective, large neural networks trained this way tend to memorize the training examples, including any noise or corruptions. This leads to poor generalization on data slightly different from the training set, making them susceptible to issues like adversarial examples and sensitivity to small distribution shifts. Standard data augmentation methods (a form of Vicinal Risk Minimization - VRM) improve generalization by creating virtual examples in the "vicinity" of training data, but they often require domain-specific knowledge (e.g., image rotations, speech noise injection).

The mixup Solution:

Mixup is proposed as a data-agnostic VRM principle. Instead of relying on domain-specific transformations, mixup generates virtual training examples by taking convex combinations of pairs of examples and their labels from the training data. For two training examples $(x_i, y_i)$ and $(x_j, y_j)$ , the mixup process creates a new virtual example $(\tilde{x}, \tilde{y})$ as follows:

$\tilde{x} = \lambda x_i + (1 - \lambda) x_j \ \tilde{y} = \lambda y_i + (1 - \lambda) y_j$

where $x_i, x_j$ are raw input feature vectors and $y_i, y_j$ are their corresponding one-hot encoded labels. The interpolation coefficient $\lambda$ is sampled from a Beta distribution $\text{Beta}(\alpha, \alpha)$ for a given hyperparameter $\alpha > 0$ . If $\alpha \to 0$ , $\lambda$ becomes concentrated at 0 or 1, recovering the original ERM principle. A typical choice is $\alpha=1$ , which makes $\lambda$ uniformly distributed between 0 and 1.

The core idea is that training on these interpolated examples encourages the model to exhibit simple, linear behavior in the space between training data points. This linear interpolation of features and targets acts as a regularization, discouraging complex, potentially overfitted decision boundaries and promoting smoother transitions between classes.

Practical Implementation:

Implementing mixup during training is straightforward and adds minimal computational overhead. For each training batch, you sample pairs of examples (e.g., by shuffling the batch and pairing original with shuffled samples). For each pair $(x_i, y_i)$ and $(x_j, y_j)$ , you sample a $\lambda$ from $\text{Beta}(\alpha, \alpha)$ and compute the mixed input $\tilde{x}$ and mixed target $\tilde{y}$ . The model is then trained using $(\tilde{x}, \tilde{y})$ with the standard loss function (e.g., cross-entropy).

Here's a pseudocode snippet based on the paper's PyTorch example:

import numpy as np
import torch

alpha = 1.0  # Hyperparameter
batch_size = x.size(0)

lam = np.random.beta(alpha, alpha)

index = torch.randperm(batch_size)

mixed_x = lam * x + (1 - lam) * x[index, :]
mixed_y = lam * y + (1 - lam) * y[index, :]

This implementation strategy involves sampling one $\lambda$ per batch and mixing each sample x[i] with x[index[i]]. This is equivalent to mixing two randomly paired samples from the batch if index is a random permutation. The paper notes that this single-batch shuffle strategy works as well as using two separate data loaders but reduces I/O requirements.

Real-world Applications and Results:

The paper demonstrates the effectiveness of mixup across various domains and tasks:

Image Classification (ImageNet, CIFAR): Mixup improves top-1 and top-5 error rates on large-scale datasets (ImageNet) and standard benchmarks (CIFAR-10, CIFAR-100) with state-of-the-art models like ResNet, ResNeXt, WideResNet, and DenseNet. The benefit is more pronounced for larger capacity models and longer training times. Tuning $\alpha$ is important; values between 0.1 and 0.4 worked well for ImageNet, while $\alpha=1$ worked well for CIFAR. Higher $\alpha$ values can lead to underfitting if too large.
Speech Data (Google commands): Applied at the spectrogram level, mixup reduces classification error on the Google commands dataset, again with better results on the larger VGG-11 model compared to LeNet. A warm-up period of initial training without mixup can sometimes speed up convergence.
Robustness to Corrupted Labels: Mixup significantly improves robustness when training with noisy labels. It achieves lower test error compared to ERM and Dropout, while also achieving lower training error on real labels (vs. memorizing corrupted ones). The combination of mixup and dropout yields even better results, indicating compatibility. Higher $\alpha$ values (e.g., 8 or 32) were more effective for resisting label corruption.
Robustness to Adversarial Examples: Mixup enhances the robustness of models against both white-box (attacker has full model knowledge) and black-box (attacker doesn't) adversarial attacks like FGSM and I-FGSM, as shown on ImageNet with ResNet-101. This increased robustness comes without requiring computationally expensive adversarial training methods or explicit gradient penalties.
Tabular Data (UCI): Mixup shows improved or comparable performance on classification tasks across several UCI datasets using simple feed-forward networks, demonstrating its applicability beyond image and speech data.
GAN Stabilization: Mixup can stabilize the training of Generative Adversarial Networks (GANs) by regularizing the discriminator. The discriminator's objective is modified to predict $\lambda$ for a mixed sample of a real image and a fake image generated by the generator: $\ell(d(\lambda x + (1 - \lambda) g(z)), \lambda)$ . This smoothing effect helps prevent vanishing gradients for the generator and leads to more stable convergence, as illustrated on toy 2D data distributions.

Implementation Considerations and Ablation Studies:

Hyperparameter $\alpha$ : Tuning $\alpha$ is crucial. A common starting point is $\alpha=1.0$ . Values are typically in the range [0.1, 0.4] for large datasets like ImageNet and can be higher (e.g., 1.0, 8.0, 32.0) for datasets like CIFAR, especially with high label corruption. A larger $\alpha$ increases regularization strength.
Weight Decay: Mixup's regularization effect means that lower weight decay values might be optimal compared to standard ERM training. Ablation studies showed that a smaller weight decay ( $10^{-4}$ ) worked better for mixup than a larger one ( $5 \times 10^{-4}$ ), whereas the opposite was true for ERM.
Interpolation Type: Ablation studies confirmed that mixing raw inputs and corresponding one-hot labels is key. Interpolating latent representations or mixing inputs while using hard labels did not perform as well.
Pairing Strategy: Mixing random pairs of examples from all classes (AC + RP) was found to be more effective than mixing only within the same class (SC) or only with nearest neighbors (KNN).
Computational Cost: Mixup adds minimal overhead: sampling $\lambda$ , linear interpolation, and potentially shuffling batch indices. The main cost remains the forward and backward passes of the network.
Applicability: While shown effective for classification, speech, tabular data, and GANs, application to structured prediction tasks like segmentation or object detection might require adapting the target interpolation logic.

Conclusion:

Mixup is a highly effective, simple, and computationally cheap data augmentation technique that improves generalization, robustness to label noise and adversarial attacks, and can stabilize GAN training. Its core principle of linear interpolation between training pairs and their targets encourages desirable linear model behavior in input space, acting as a powerful regularizer that complements existing methods like Dropout. Its data-agnostic nature makes it broadly applicable across various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TheGrizztronic/status/1748534592628297994

YouTube

Show All Videos