SWM-AED: Sliding Mask Adversarial Detection
- The paper introduces a sliding window mask mechanism that computes Sliding Mask Confidence Entropy (SMCE) to capture prediction fluctuations between clean and adversarial images.
- It employs a model-agnostic, retraining-free approach where localized occlusions reveal significant confidence variability in adversarial attacks.
- Empirical evaluations on CIFAR-10 show detection accuracies from 62.5% to 96%, outperforming several adversarial training methods and demonstrating robust generalizability.
Sliding Window Mask-based Adversarial Example Detection (SWM-AED) is a detection framework for adversarial examples that operates by quantifying the sensitivity of deep neural network (DNN) predictions to local occlusions via a sliding window mask. Motivated by empirical findings that adversarial images exhibit significantly higher classifier confidence fluctuation under localized masking compared to clean images, SWM-AED exploits this property to distinguish adversarial attacks across a wide array of image-based perturbation methods. The method introduces the concept of Sliding Mask Confidence Entropy (SMCE) as a detection statistic, offering a model-agnostic and retraining-free approach robust to many canonical attacks.
1. Sliding Window Mask Mechanism
SWM-AED applies a dark occlusion patch ("mask") of fixed size to an input image . The mask is systematically "slid" across all spatial locations where it fully fits within the image, using a stride (typically ). At each position , an occluded image is generated by setting the masked region’s pixels to zero (or a designated fixed value).
Each occluded image is passed through a pretrained classifier , producing a softmax confidence vector over classes. The characteristic property underlying the method is that clean images tend to be robust against such localized occlusion—the classifier’s top-1 confidence remains stable—while adversarial examples, which are typically positioned near decision boundaries, react with pronounced fluctuations or confidence collapse for certain mask placements.
2. Sliding Mask Confidence Entropy (SMCE)
The SMCE provides a scalar measure of the model’s prediction volatility under sliding mask perturbations. Formally, for sliding positions, SMCE is defined as:
For each mask position:
- Compute the post-mask confidence vector
- Compute its entropy
- Average all to get an overall entropy score
Key properties:
- (achieved under uniform confidence)
SMCE reflects the degree of confidence destabilization under occlusion. Clean samples yield low entropy, while adversarial samples—especially those near classification boundaries—result in much higher SMCE due to large softmax perturbations.
3. SWM-AED Algorithm
The SWM-AED detection decision is based on whether an image’s SMCE exceeds a threshold . The process is summarized below:
1 2 3 4 5 6 7 8 9 10 11 |
Input: image x, classifier f, mask size m, stride s, threshold τ Output: is_adversarial ∈ {True, False} 1. H ← 0 ; positions ← all (i,j) where mask fits in x with stride s 2. For each (i,j) in positions: M ← zero mask of size m×m placed at (i,j) p ← f(x ⊙ M) # softmax confidence over classes h ← − sum_k p[k] * log2(p[k]) H ← H + h 3. H ← H / |positions| # average entropy 4. Return (H > τ) |
Typical parameter choices for CIFAR-10 include , , and . If SMCE exceeds , the sample is flagged as adversarial. No model retraining is involved; the base classifier is used as-is.
4. Theoretical Underpinnings
The rationale for SWM-AED roots itself in the geometric properties of adversarial examples in deep feature space:
- Adversarial images are constructed to reside close to a decision boundary. Minor perturbations, such as those introduced by small occlusions, are disproportionately likely to alter the classifier’s output or diffuse the softmax distribution, resulting in increased entropy.
- Clean images are typically embedded within “confidence basins” where local occlusions scarcely impact the classifier’s decision. This computes as lower SMCE.
- SMCE thus operationalizes the distinct local smoothness between clean and adversarial regions, serving as a metric for localized vulnerability.
5. Experimental Validation
Evaluations were conducted on CIFAR-10, employing 1,800 clean and 1,800 adversarial images (200 per attack). Adversarial attacks included FGSM, PGD, BIM, JSMA, DeepFool, FFGSM, APGD, OnePixel, and PIFGSM-PP. Tested classifiers included ResNet-18 (≈96% accuracy), ResNet-50 (≈79%), and VGG-11 (≈81%). Masks of size , , and were employed with stride 1. Thresholds were swept in [0.05, 0.2], with identified as robust.
Detection accuracy (ResNet-18, mask, ): | Attack | Accuracy (%) | |--------------|--------------| | JSMA | 96.0 | | DeepFool | 90.0 | | FGSM | 85.5 | | PGD | 76.0 | | BIM | 62.5 | | FFGSM | 82.0 | | APGD | 75.0 | | OnePixel | 73.5 | | Pixle | 89.0 | | PIFGSM-PP | 79.0 |
Average detection was ~80%. The method demonstrated robustness across attacks, with worst-case 62.5% (BIM) and best-case 96.5% (JSMA) detection rates. Smaller masks () yielded slightly higher average performance, but was optimal for CIFAR-10 images. Higher-capacity classifiers had greater SMCE separability between clean and adversarial inputs, enhancing detection.
Compared to adversarial training methods (PGD-AT, Fast-AT, Free-AT), which defend ~47% of PGD examples, SWM-AED achieved ~76% detection on PGD.
6. Comparison with Other Defenses and Deployment Practices
Unlike adversarial training and other defenses, SWM-AED:
- Does not require retraining or network modification, precluding risks of catastrophic overfitting and obviating additional training-time computational load.
- Generalizes across attack types due to reliance on the intrinsic property of occlusion sensitivity, rather than attack-specific signatures or gradient masking.
- Operates in a model-agnostic fashion, functioning with any pretrained classifier ("plug-and-play") and requiring no extra detection head or reconstruction network. Detection performance benefits naturally from improvements in base model accuracy.
Recommended deployment guidelines:
- Use a deep, high-accuracy model (e.g., ResNet-18 or deeper) for SMCE computation.
- Select a mask size approximately one quarter of the image width (for CIFAR-10, ).
- Employ stride 1 for comprehensive spatial sampling.
- Calibrate threshold on a mixed hold-out set of clean and adversarial data, beginning at ~0.1.
- Periodically monitor SMCE value distributions to detect overlap drift; adjust model capacity or mask size as necessary.
7. Limitations, Implications, and Outlook
SWM-AED achieves strong empirical detection (62%–96.5%) across diverse attacks without model retraining or architectural changes. Its effectiveness arises from quantifying susceptibility to local occlusion—a property that distinguishes clean from adversarial samples for modern DNN classifiers. Potential limitations include the possibility of diminished separability if future attacks explicitly optimize for SMCE invariance or if the base classifier demonstrates low accuracy, thereby compressing the SMCE gap between clean and adversarial distributions.
A plausible implication is that continual increases in model capacity and accuracy will enhance SMCE separability, consolidating the method’s utility as a lightweight, scalable adversarial screening layer for deployed image classifiers.