Adversarial Training Can Hurt Generalization (1906.06032v2)

Published 14 Jun 2019 in cs.LG and stat.ML

Abstract: While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data. Furthermore, since our construction is based on a convex learning problem, we rule out optimization concerns, thus laying bare a fundamental tension between robustness and generalization. Finally, we show that robust self-training mostly eliminates this tradeoff by leveraging unlabeled data.

Citations (227)

View on Semantic Scholar

Summary

The paper reveals that adversarial training can hurt generalization on clean data despite improving robustness against adversarial attacks.
It introduces a convex ‘staircase’ problem to isolate how fitting a complex staircase structure requires more data than learning a simpler linear predictor.
The study further shows that leveraging additional unlabeled data with robust self-training effectively narrows the tradeoff and improves standard accuracy.

This paper, "Adversarial Training Can Hurt Generalization" (Adversarial Training Can Hurt Generalization, 2019), investigates a counter-intuitive phenomenon in deep learning: while adversarial training improves a model's robustness against adversarial attacks, it often leads to a decrease in standard accuracy on clean, unperturbed data. Previous explanations for this tradeoff included a fundamental conflict between standard and robust objectives (where the optimal predictor isn't robust) or limited model capacity. This paper demonstrates that this tradeoff can occur even when the optimal predictor is both standard-accurate and robust, suggesting the issue lies in generalization with finite data rather than inherent objective conflict or capacity limits.

The authors construct a simple, convex learning problem, dubbed the "staircase," to isolate this statistical generalization issue. The problem involves predicting a value $y$ based on an input $x$ , where the optimal predictor $f^\star(x)$ has a staircase-like structure, specifically $f^\star(x) = m \lfloor x \rceil$ for some slope $m$ . The data distribution $P_x$ is concentrated on integer points $line = \{0, 1, \dots, s-1\}$ , with low probability assigned to points slightly perturbed from these integers ( $line^c$ ). The invariance set $B(x)$ for any point $x$ is defined around the nearest integer $\lfloor x \rceil$ , including $\lfloor x \rceil$ , $\lfloor x \rceil + \epsilon$ , and $\lfloor x \rceil - \epsilon$ . The key is that $f^\star(x)$ is constant within each $B(x)$ for any $x$ , meaning the optimal predictor is robust according to the paper's definition.

In this setup, standard training minimizes empirical risk on the sampled points, while robust training minimizes the worst-case loss over the invariance set $B(x_i)$ for each training sample $(x_i, y_i)$ . The hypothesis class used for the staircase problem is cubic B-splines, which is expressive enough to contain $f^\star$ .

The paper's core finding from the staircase simulation is that with a small number of training samples ( $n$ ), standard training tends to learn a simple predictor (close to a linear function) that fits the high-probability points in $line$ . Since most test points are also in $line$ , this simple predictor generalizes well to standard test data. Robust training, however, is forced to fit the low-probability points in $line^c$ by penalizing worst-case loss over $B(x_i)$ . To satisfy the robust objective for points in $line$ , the model must learn a staircase structure locally around each integer point, which is a more complex function than a simple line. With few samples, there might not be enough data to learn this complex staircase structure accurately across the entire input space, leading to poorer generalization on standard test data compared to the simpler standard-trained model (Figure~\ref{fig:tradeoff}a, Figure~\ref{fig:tradeoff-small-sample}). As the number of samples increases, the training set covers $line$ more densely, allowing the robust model to learn the correct staircase structure more accurately, and its standard test error decreases, eventually surpassing the standard model (Figure~\ref{fig:tradeoff-large-sample}).

This behavior in the controlled convex problem mirrors observations on real datasets like CIFAR-10. The authors subsample CIFAR-10 to simulate different sample sizes and train standard and adversarially trained Wide ResNet models. They observe that the gap in standard test accuracy between standard and adversarially trained models is larger with fewer samples and decreases as the number of samples increases (Figure~\ref{fig:tradeoff}b). This supports the hypothesis that insufficient data is a significant factor in the standard accuracy drop seen with adversarial training in practice.

A key practical implication is that this tradeoff appears to be a data-driven problem. If the standard accuracy drop is due to the robust model requiring more data to generalize, providing more data should mitigate the issue. The paper explores robust self-training (RST) as a method to leverage additional unlabeled data. RST involves:

Training a standard model on the available labeled data.
Using this standard model to generate pseudo-labels for a pool of unlabeled data.
Performing robust training on the combined labeled data and pseudo-labeled unlabeled data.

In the staircase problem, RST mostly eliminates the standard accuracy tradeoff, achieving standard test error comparable to standard training while maintaining robustness (Figure~\ref{fig:tradeoff}c). The paper references external results (Strong converse bounds in quantum network information theory: distributed hypothesis testing and source coding, 2019) showing that RST applied to CIFAR-10 with additional unlabeled data also significantly improves both standard and robust accuracy compared to traditional adversarial training (Table~\ref{tab:rst-cifar}), further supporting the idea that unlabeled data can help bridge the gap.

The paper also contrasts the "robustness hurts" scenario with cases where adversarial training can "help" generalization. If the optimal predictor is simple (e.g., a line with slope $m=0$ in the staircase example, Figure~\ref{fig:schematicb}) and there is noise in the data, robust training can act as a regularizer. By enforcing invariance over $B(x)$ , robust training makes the model less sensitive to noise fluctuations within the invariance set, which can lead to lower standard test error, especially with limited data (Figure~\ref{fig:tradeon}a). Experiments on MNIST show a similar trend, where adversarial training yields lower standard test error than standard training, and this gap is largest with few samples (Figure~\ref{fig:tradeon}b). The authors attribute their positive MNIST results to better optimization compared to prior work.

Implementation Considerations:

Computational Cost: Adversarial training is significantly more computationally expensive than standard training because it involves an inner optimization loop (finding the worst-case perturbation) for each gradient step. RST adds further complexity by requiring training a standard model and then robustly training on a larger, augmented dataset. This means deploying robust models trained via adversarial training or RST requires substantially more computational resources during training.
Data Requirements: The paper strongly suggests that adversarial training requires more data than standard training to achieve comparable standard accuracy. Practitioners implementing robust models should be prepared for higher data demands or consider methods like RST if unlabeled data is available.
Robust Self-Training: Implementing RST requires access to a large pool of unlabeled data from the target distribution. The quality of pseudo-labels generated by the standard model is crucial; poor pseudo-labels could hinder the training process. Careful hyperparameter tuning for both the standard model and the subsequent robust training on the augmented data is necessary.
Choice of Attack: The specific adversarial attack used for training (e.g., PGD, FGSM, different $\ell_p$ norms) and its parameters ( $\epsilon$ , step size, number of steps) significantly impact the resulting model's robustness and potentially its standard accuracy tradeoff.
Model Architecture: While the paper uses simple models (B-splines, small CNN, Wide ResNet), the findings suggest the generalization issue is general. The choice of architecture and its capacity relative to the data size will still influence the observed tradeoff.

In summary, the paper provides strong evidence that the standard accuracy drop in adversarial training is often a finite-sample generalization problem, not an inherent conflict of objectives when the optimal function is robust. It highlights that robust models can be more complex (in terms of functional form) and thus require more data to generalize well. Promisingly, it shows that leveraging unlabeled data through techniques like robust self-training can be an effective strategy to mitigate this tradeoff and improve both standard and robust performance.

PDF Markdown

Adversarial Training Can Hurt Generalization (1906.06032v2)

Summary

Related Papers