WAVS: Adversarial Vulnerability Metric
- WAVS is a formal metric that quantifies machine learning model susceptibility to adversarial attacks using normalized and weighted scores.
- It integrates per-instance vulnerability measures through normalization and convex aggregation to capture both individual weaknesses and systemic threats.
- Its instantiations in vision, language, and multimodal domains provide actionable insights for defense benchmarking and red teaming.
The Weighted Adversarial Vulnerability Score (WAVS) is a formal metric family that quantifies the susceptibility of machine learning systems—including deep neural networks, vision-LLMs, and LLMs—to adversarial attacks. By aggregating per-instance vulnerability assessments with explicit normalization and domain-relevant weighting, WAVS enables robust, interpretable evaluation of both individual model weaknesses and systemic threats arising from diverse adversarial strategies. The metric has emerged in multiple research areas with instantiations tailored for vision, language, and multimodal models, providing a unified scoring paradigm for both defense benchmarking and red teaming.
1. Core Mathematical Framework
WAVS centers on the aggregation of per-instance or per-scenario adversarial vulnerability scores into a scalar that summarizes robustness. While specific forms vary, the essential components are:
- Instance-level vulnerability: A function capturing how susceptible a given input (or model decision) is to adversarial perturbation, based on margin, loss, or attack-induced score change.
- Normalization: Transformation of component scores to common scales, typically via min–max normalization or division by a base value.
- Weighting and aggregation: Convex combination or arithmetic mean pooling of vulnerability components, yielding a single dataset-level or scenario-level measure.
For example, in adversarially trained vision models, per-example WAVS is defined as the rescaled perturbation radius:
where is a margin- or deviation-based vulnerability metric, is the fixed base radius, and modulates the scaling. For dataset-level evaluation:
This directly reflects the mean perturbation budget encountered in adversarial training (Fakorede et al., 6 Mar 2024).
In evaluation of LLM-based scientific assessment, WAVS takes the form:
with
where denotes normalization, , and the score integrates score inflation, semantic flip, and risk weights (Sahoo et al., 11 Dec 2025).
2. Notable Instantiations
WAVS is instantiated with domain-adapted vulnerability measures and aggregation procedures, the most prominent being:
| WAVS Variant | Domain | Core Vulnerability Measure(s) |
|---|---|---|
| MWPB/SDWPB (Fakorede et al., 6 Mar 2024) | Vision (DNNs) | Margin or logit stddev (per-sample) |
| Non-uniform Minimax (Zeng et al., 2020) | Vision (DNNs) | Exponential of negative adversarial margin |
| VLM Safety (Rashid et al., 22 Feb 2025) | Vision-Language | Accuracy drops under random/FSMG noise |
| Peer Review LLM (Sahoo et al., 11 Dec 2025) | LLMs (review systems) | Score inflation, flip severity, risk label |
In all settings, WAVS provides a mechanism for principled risk-sensitive robustness diagnostics, aligning with practical attack models and domain-specific threat priorities.
3. Methodologies for Computing and Aggregating WAVS
Vision models (AT, TRADES, MART): Per-sample vulnerability is determined via either the multi-class margin
or the standard deviation of logits about the correct class
with per-sample perturbation budgets:
Dataset-level WAVS is computed via the mean of over all (Fakorede et al., 6 Mar 2024).
Weighted Minimax (non-uniform attacks): The adversarial risk is weighted by
and can be normalized over test points:
yielding a per-example selection probability interpreted as WAVS (Zeng et al., 2020).
Vision-LLMs: WAVS is a convex combination of the relative drop in accuracy under composite random noise versus under FGSM:
with each score defined as a relative decrease from the clean baseline (Rashid et al., 22 Feb 2025).
LLM-based review systems: WAVS for each (paper, attack) is the normalized weighted sum of score delta, semantic flip severity, and risk alignment, then aggregated by average over all instances (Sahoo et al., 11 Dec 2025).
4. Empirical Insights and Evaluations
Margin/Stddev-Weighted Adversarial Training (MWPB/SDWPB): Integrating WAVS-driven per-sample budgets in adversarial training led to consistent improvements in robust accuracy across CIFAR-10, SVHN, TinyImageNet, and both white-box and black-box threats (PGD-20, CW, AA, Square, SPSA). For example, on ResNet-18/CIFAR-10:
- Standard AT: PGD-20 accuracy 52.72%
- MWPB-AT: PGD-20 accuracy 56.25% (+3.53)
- SDWPB-AT: PGD-20 accuracy 56.69% (+3.97)
WAVS directly summarized the mean perturbation intensity encountered, providing both a training control signal and an evaluation statistic (Fakorede et al., 6 Mar 2024).
Non-uniform attack resistance: Weighted minimax (WAVS-aligned) adversarial training preserved or slightly improved uniform robust accuracy while raising adversarial accuracy under non-uniform attacks by several points. For CIFAR-10 (20-step PGD, ):
| Defense | |||
|---|---|---|---|
| PGD (base) | 49.29% | 11.66% | 9.72% |
| PGD+WAVS | 49.53% | 13.19% | 10.81% |
By learning to emphasize highly vulnerable examples (larger ), the defense adapts to structured, threat-adaptive adversaries (Zeng et al., 2020).
Vision-language safety: On CLIP/Caltech-256, WAVS captured the stark contrast between natural and adversarial noise, quantifying both information-theoretic and gradient-based sensitivity in a single interpretable value. With baseline accuracy of 95%, composite random noise dropped accuracy by ≈29 points, FGSM by ≈90, yielding a balanced WAVS of ≈59.8 (for ) (Rashid et al., 22 Feb 2025).
LLM-based review vulnerability: WAVS separated superficial score-nudging from deep semantic flips and risk, revealing robust differences between open-source and proprietary models, and across attack strategies. Top strategies on open models achieved WAVS ≈0.81, while GPT-5 maintained WAVS near zero, demonstrating strong defense alignment (Sahoo et al., 11 Dec 2025).
5. Theoretical Motivation and Interpretability
WAVS addresses several deficiencies in classical adversarial vulnerability metrics:
- Non-uniform adversarial focus: WAVS quantifies not just global robustness, but the distribution of vulnerability across instances/scenarios, reflecting adversary adaptivity.
- Integration of multiple risk dimensions: Especially in LLM settings, WAVS fuses score inflation, flip risk, and ground-truth severity, yielding actionable decompositions.
- Hyperparameterized focus: The weighting scheme allows tuning according to operational risk—natural perturbations versus targeted attacks (Rashid et al., 22 Feb 2025, Sahoo et al., 11 Dec 2025).
- Direct training and diagnostic utility: In AT, WAVS surfaces as both a loss-reweighting (during training) and an evaluative summary (after training) (Fakorede et al., 6 Mar 2024, Zeng et al., 2020).
6. Practical Considerations and Limitations
Assumptions and normalization: WAVS calculations assume access to per-instance logits, risk labels, or adversarial deltas as appropriate. Normalization choices (e.g., min–max) can be sensitive to outliers, and must be consistent across evaluations for fair comparison (Sahoo et al., 11 Dec 2025). In peer review settings, mislabeling of paper risk can distort WAVS.
Domain-specific weighting: The parameters governing the mix of components (e.g., , , for LLMs; , for VLMs) are currently heuristic but allow for alignment with domain risk profiles. Sensitivity analyses indicate that stable comparative rankings are obtained if critical components (flip severity/risk) receive at least 60% cumulative weight (Sahoo et al., 11 Dec 2025).
Threat space coverage: For an adversarially comprehensive WAVS, representative attack and perturbation strategies must be included. The metric does not account for "backfire" robustness (cases where attacks inadvertently increase accuracy), which is a limitation in its current forms (Sahoo et al., 11 Dec 2025).
7. Open Questions and Future Directions
Open questions regarding WAVS include:
- Optimal weighting for novel domains: The heuristic assignment of component weights invites further theoretical and empirical justification tailored to new application areas (e.g., grant review, medical triage).
- Extension to multimodal and interactive systems: Initial WAVS variants exist for images, text, and multimodal I/O, but rigorous frameworks for video, tabular, and interactive decision-making are a prospective research direction (Sahoo et al., 11 Dec 2025).
- Defensive strategies for WAVS minimization: Identifying adversarial training regimes, regularizers, or alignment objectives that directly minimize WAVS, particularly in the context of dynamic, adaptive attackers, remains an open engineering and theoretical problem.
WAVS provides a principled, adaptable, and interpretable lens on adversarial vulnerability, enabling rigorous cross-domain robustness evaluation and targeted defense development. Its practical deployment requires careful threat modeling, control of normalization and weighting, and continued calibration against evolving adversarial paradigms.