Theoretically Principled Trade-off between Robustness and Accuracy (1901.08573v3)

Published 24 Jan 2019 in cs.LG and stat.ML

Abstract: We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of mean $\ell_2$ perturbation distance.

Citations (2,332)

View on Semantic Scholar

Summary

The paper proposes a new theoretical framework that decomposes robust error into natural and boundary errors to clarify the trade-off between accuracy and robustness.
It introduces the TRADES algorithm which minimizes a regularized surrogate loss to enhance both natural accuracy and adversarial robustness.
Experimental results confirm that adjusting the regularization parameter effectively controls the balance between standard performance and resistance to adversarial attacks.

This paper, "Theoretically Principled Trade-off between Robustness and Accuracy" (1901.08573), investigates the fundamental tension between achieving high natural accuracy on standard examples and high robust accuracy on adversarial examples. The authors propose a theoretical framework to understand this trade-off and derive a new adversarial training algorithm called TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization).

The core theoretical contribution lies in decomposing the robust error, $R_{rob}(f)$ , which is the probability that a classifier $f$ misclassifies any example within an $\epsilon$ -ball around a natural example, into two components: the natural classification error, $R_{nat}(f)$ , and a term called the boundary error, $R_{bdy}(f)$ . The boundary error measures the probability that a natural example is correctly classified but lies within an $\epsilon$ -neighborhood of the classifier's decision boundary, making it susceptible to small perturbations. Formally, the decomposition is $R_{rob}(f) = R_{nat}(f) + R_{bdy}(f)$ .

The paper then utilizes the theory of classification-calibrated surrogate losses to derive a differentiable upper bound on the robust error gap, $R_{rob}(f) - R_{nat}^*$ , where $R_{nat}^*$ is the optimal natural error. Theorem 1 states that for a classification-calibrated loss $\phi$ and its associated $\psi$ -transform, this gap is bounded by a term related to the excess surrogate risk $\psi^{-1}(R_\phi(f) - R_\phi^*)$ plus the boundary error term $\Pr[X \in B(\text{DB}(f), \epsilon), f(X)Y > 0]$ . This boundary error term is further bounded by $E \max_{X' \in B(X, \epsilon)} \phi(f(X')f(X)/\lambda)$ . The intuition is that a large boundary error, i.e., a high probability of natural data being close to the decision boundary, leads to high robust error. Minimizing the robust error requires reducing both natural error and boundary error. Theorem 2 provides a lower bound, suggesting the upper bound is tight in a theoretical sense.

Inspired by this theoretical analysis, the authors propose the TRADES algorithm which minimizes a regularized surrogate loss function:

$\min_f E \left\{\phi(f(X)Y) + \max_{X' \in B(X, \epsilon)} \phi(f(X)f(X')/\lambda)\right\}$

This objective directly reflects the trade-off:

The first term, $\phi(f(X)Y)$ , is the standard empirical risk, encouraging the model to achieve high natural accuracy.
The second term, $\max_{X' \in B(X, \epsilon)} \phi(f(X)f(X')/\lambda)$ , acts as a regularizer. It encourages the model output for a natural example $f(X)$ to be similar to the output for its adversarial counterpart $f(X')$ , where $X'$ is the adversarial perturbation within the $\epsilon$ -ball $B(X, \epsilon)$ that maximizes the chosen surrogate loss $\phi$ applied to the product $f(X)f(X')$ . This term effectively pushes the decision boundary away from the data points, thereby improving robustness. The regularization parameter $\lambda$ controls the balance between minimizing natural error and minimizing boundary error/maximizing robustness. A smaller $\lambda$ (larger $1/\lambda$ ) puts more weight on the regularization term, favoring robustness over natural accuracy, and vice versa, which is empirically confirmed in the experiments.

The paper contrasts TRADES with existing adversarial defense methods:

Robust Optimization [madry2018towards]: These methods typically minimize $E[\max_{X' \in B(X, \epsilon)} \phi(f(X')Y)]$ . While effective, the paper argues this objective might not be a tight upper bound on robust error and doesn't explicitly model the accuracy-robustness trade-off as clearly as TRADES's decomposition.
Other Regularization Methods [kurakin2016adversarial, ross2017improving, zheng2016improving]: These methods often use regularization terms based on the difference between $f(X')$ and $Y$ , or the gradient of $f(X')$ , lacking the theoretical backing provided by TRADES's boundary error analysis. TRADES's regularization term measures the difference between $f(X)$ and $f(X')$ , which directly relates to the smoothness/distance to the decision boundary.
Adversarial Logit Pairing (ALP) [kannan2018adversarial]: ALP minimizes a weighted sum of $\phi(f(X')Y)$ , $\phi(f(X)Y)$ , and $\|f(X)-f(X')\|_2$ . TRADES differs by using a classification-calibrated loss for the $f(X)$ vs $f(X')$ difference and by formulating the inner maximization differently based on the theoretical bounds.

For practical implementation, especially in multi-class settings, TRADES extends the objective using a multi-class calibrated loss like cross-entropy:

$\min_f E \left\{L(f(X), Y) + \max_{X' \in B(X, \epsilon)} L(f(X), f(X'))/\lambda\right\}$

where $L(\cdot, \cdot)$ is the multi-class loss. The optimization is typically performed using alternating gradient descent/ascent. The inner maximization problem (finding the adversarial example $X'$ ) is approximately solved using Projected Gradient Descent (PGD) with respect to the input $X$ , maximizing the term $L(f_\theta(x_i), f_\theta(x_i'))$ . Since $x_i$ is a local minimum of this inner objective if $f_\theta(x_i)$ and $f_\theta(x_i')$ are the same, the optimization starts with a small random perturbation around $x_i$ (Algorithm 1, Step 5) to find a better starting point for the adversarial search. The outer minimization (updating network parameters $\theta$ ) is done via gradient descent on the full objective. The paper mentions acceleration techniques [shafahi2019adversarial, zhang2019you] that can make TRADES training more efficient.

Experimental results demonstrate the effectiveness of TRADES:

Theoretical Tightness: On a binary MNIST task, the empirical difference between the left-hand side ( $R_{rob}(f)-R_{nat}^*$ ) and the right-hand side ( $\psi^{-1}(R_\phi(f)-R_\phi^*) + E \max \phi(f(X)f(X')/\lambda)$ ) of Theorem 1 is shown to be small across various $\lambda$ values (Table 3).
Sensitivity to $\lambda$ : Experiments on MNIST and CIFAR10 show that increasing $1/\lambda$ (more emphasis on regularization) leads to increased robust accuracy and decreased natural accuracy, empirically verifying the predicted trade-off (Table 4).
White-box Attacks: TRADES achieves state-of-the-art robust accuracy on CIFAR10 ( $56.61\%$ robust accuracy under $\ell_\infty=0.031$ PGD attack with 20 iterations using WRN-34-10), significantly outperforming prior methods, including other regularization techniques and robust optimization (Table 5). It also performs well against various other white-box attacks like DeepFool, LBFGSAttack, MI-FGSM, and C&W. On MNIST ( $\ell_\infty=0.3$ ), TRADES achieves robust accuracy comparable to or slightly better than Madry et al. madry2018towards (Table 5).
Black-box Attacks: TRADES-trained models show higher robust accuracy against black-box attacks (transferred from naturally trained or Madry's models) compared to Madry's models, and their adversarial examples are more effective at attacking other models (Tables 6, 7, 8, 9). This suggests better generalization of robustness.
NeurIPS 2018 Adversarial Vision Challenge: The TRADES methodology was the foundation for the winning entry in the black-box Tiny ImageNet competition, achieving the highest mean $\ell_2$ perturbation distance by a significant margin (Figure 4).
Interpretability: A variant of TRADES applied to the bird-or-bicycle dataset shows that adversarial examples generated by a boundary attack exhibit features of both classes (Figures 5, 6, 7, 8), suggesting that the robust model learns decision boundaries that separate classes based on more discriminative features, leading to improved interpretability compared to non-robust models.

In summary, the paper provides a theoretically grounded framework for understanding the accuracy-robustness trade-off through the decomposition of robust error. This framework leads to the TRADES algorithm, a novel adversarial training method that explicitly optimizes this trade-off via a regularized surrogate loss. TRADES demonstrates strong empirical performance against various white-box and black-box attacks across different datasets and achieved top performance in a major adversarial competition, highlighting its practical effectiveness in training robust neural networks.

PDF Markdown

Related Papers

GitHub

GitHub - yaodongyu/TRADES: TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization) (543 stars)