WAVS: Adversarial Vulnerability Metric

Updated 15 December 2025

WAVS is a formal metric that quantifies machine learning model susceptibility to adversarial attacks using normalized and weighted scores.
It integrates per-instance vulnerability measures through normalization and convex aggregation to capture both individual weaknesses and systemic threats.
Its instantiations in vision, language, and multimodal domains provide actionable insights for defense benchmarking and red teaming.

The Weighted Adversarial Vulnerability Score (WAVS) is a formal metric family that quantifies the susceptibility of machine learning systems—including deep neural networks, vision-LLMs, and LLMs—to adversarial attacks. By aggregating per-instance vulnerability assessments with explicit normalization and domain-relevant weighting, WAVS enables robust, interpretable evaluation of both individual model weaknesses and systemic threats arising from diverse adversarial strategies. The metric has emerged in multiple research areas with instantiations tailored for vision, language, and multimodal models, providing a unified scoring paradigm for both defense benchmarking and red teaming.

1. Core Mathematical Framework

WAVS centers on the aggregation of per-instance or per-scenario adversarial vulnerability scores into a scalar that summarizes robustness. While specific forms vary, the essential components are:

Instance-level vulnerability: A function capturing how susceptible a given input (or model decision) is to adversarial perturbation, based on margin, loss, or attack-induced score change.
Normalization: Transformation of component scores to common scales, typically via min–max normalization or division by a base value.
Weighting and aggregation: Convex combination or arithmetic mean pooling of vulnerability components, yielding a single dataset-level or scenario-level measure.

For example, in adversarially trained vision models, per-example WAVS is defined as the rescaled perturbation radius:

$\text{WAVS}_i = \Delta_i = \varepsilon \cdot \exp(\alpha v_i)$

where $v_i$ is a margin- or deviation-based vulnerability metric, $\varepsilon$ is the fixed base radius, and $\alpha > 0$ modulates the scaling. For dataset-level evaluation:

$\text{WAVS}_{\text{dataset}} = \frac{1}{n} \sum_{i=1}^n \Delta_i$

This directly reflects the mean perturbation budget encountered in adversarial training (Fakorede et al., 6 Mar 2024).

In evaluation of LLM-based scientific assessment, WAVS takes the form:

$\text{WAVS}_M = \frac{1}{|P|\,|A|} \sum_{p\in P} \sum_{a\in A} V(p,a;M)$

with

$V(p,a;M) = w_S N_S(\Delta S) + w_F N_F(I_{\text{flip}}) + w_R N_R(\omega_{\text{severity}})$

where $N_{(\cdot)}$ denotes normalization, $w_S+w_F+w_R=1$ , and the score integrates score inflation, semantic flip, and risk weights (Sahoo et al., 11 Dec 2025).

2. Notable Instantiations

WAVS is instantiated with domain-adapted vulnerability measures and aggregation procedures, the most prominent being:

WAVS Variant	Domain	Core Vulnerability Measure(s)
MWPB/SDWPB (Fakorede et al., 6 Mar 2024)	Vision (DNNs)	Margin or logit stddev (per-sample)
Non-uniform Minimax (Zeng et al., 2020)	Vision (DNNs)	Exponential of negative adversarial margin
VLM Safety (Rashid et al., 22 Feb 2025)	Vision-Language	Accuracy drops under random/FSMG noise
Peer Review LLM (Sahoo et al., 11 Dec 2025)	LLMs (review systems)	Score inflation, flip severity, risk label

In all settings, WAVS provides a mechanism for principled risk-sensitive robustness diagnostics, aligning with practical attack models and domain-specific threat priorities.

3. Methodologies for Computing and Aggregating WAVS

Vision models (AT, TRADES, MART): Per-sample vulnerability is determined via either the multi-class margin

$d_m(x_i, y_i; \theta) = f_\theta(x_i)_{y_i} - \max_{k\neq y_i} f_\theta(x_i)_k$

or the standard deviation of logits about the correct class

$d_{\text{std}}(x_i, y_i; \theta) = \sqrt{ \frac{1}{|C|} \sum_{k=1}^C (f_\theta(x_i)_k - f_\theta(x_i)_{y_i})^2 }$

with per-sample perturbation budgets:

$\Delta_i = \varepsilon \cdot \exp(\alpha v_i)$

Dataset-level WAVS is computed via the mean of $\Delta_i$ over all $i$ (Fakorede et al., 6 Mar 2024).

Weighted Minimax (non-uniform attacks): The adversarial risk is weighted by

$s_i = \exp(-\alpha \, \mathrm{margin}(f_\theta, x_i', y_i))$

and can be normalized over test points:

$\widetilde{s}_i = \frac{s_i}{\sum_{j=1}^{N_{\text{test}}} s_j}$

yielding a per-example selection probability interpreted as WAVS (Zeng et al., 2020).

Vision-LLMs: WAVS is a convex combination of the relative drop in accuracy under composite random noise versus under FGSM:

$\text{WAVS} = w_1 \cdot \text{Noise Impact Score} + w_2 \cdot \text{FGSM Impact Score}$

with each score defined as a relative decrease from the clean baseline (Rashid et al., 22 Feb 2025).

LLM-based review systems: WAVS for each (paper, attack) is the normalized weighted sum of score delta, semantic flip severity, and risk alignment, then aggregated by average over all instances (Sahoo et al., 11 Dec 2025).

4. Empirical Insights and Evaluations

Margin/Stddev-Weighted Adversarial Training (MWPB/SDWPB): Integrating WAVS-driven per-sample budgets in adversarial training led to consistent improvements in robust accuracy across CIFAR-10, SVHN, TinyImageNet, and both white-box and black-box threats (PGD-20, CW, AA, Square, SPSA). For example, on ResNet-18/CIFAR-10:

Standard AT: PGD-20 accuracy 52.72%
MWPB-AT: PGD-20 accuracy 56.25% (+3.53)
SDWPB-AT: PGD-20 accuracy 56.69% (+3.97)

WAVS directly summarized the mean perturbation intensity encountered, providing both a training control signal and an evaluation statistic (Fakorede et al., 6 Mar 2024).

Non-uniform attack resistance: Weighted minimax (WAVS-aligned) adversarial training preserved or slightly improved uniform robust accuracy while raising adversarial accuracy under non-uniform attacks by several points. For CIFAR-10 (20-step PGD, $\epsilon=0.031$ ):

Defense	$\mathcal{A}_\text{rob}$	$\mathcal{A}_\text{sa}$	$\mathcal{A}_\text{tr}$
PGD (base)	49.29%	11.66%	9.72%
PGD+WAVS	49.53%	13.19%	10.81%

By learning to emphasize highly vulnerable examples (larger $s_i$ ), the defense adapts to structured, threat-adaptive adversaries (Zeng et al., 2020).

Vision-language safety: On CLIP/Caltech-256, WAVS captured the stark contrast between natural and adversarial noise, quantifying both information-theoretic and gradient-based sensitivity in a single interpretable value. With baseline accuracy of 95%, composite random noise dropped accuracy by ≈29 points, FGSM by ≈90, yielding a balanced WAVS of ≈59.8 (for $w_1=w_2=0.5$ ) (Rashid et al., 22 Feb 2025).

LLM-based review vulnerability: WAVS separated superficial score-nudging from deep semantic flips and risk, revealing robust differences between open-source and proprietary models, and across attack strategies. Top strategies on open models achieved WAVS ≈0.81, while GPT-5 maintained WAVS near zero, demonstrating strong defense alignment (Sahoo et al., 11 Dec 2025).

5. Theoretical Motivation and Interpretability

WAVS addresses several deficiencies in classical adversarial vulnerability metrics:

Non-uniform adversarial focus: WAVS quantifies not just global robustness, but the distribution of vulnerability across instances/scenarios, reflecting adversary adaptivity.
Integration of multiple risk dimensions: Especially in LLM settings, WAVS fuses score inflation, flip risk, and ground-truth severity, yielding actionable decompositions.
Hyperparameterized focus: The weighting scheme allows tuning according to operational risk—natural perturbations versus targeted attacks (Rashid et al., 22 Feb 2025, Sahoo et al., 11 Dec 2025).
Direct training and diagnostic utility: In AT, WAVS surfaces as both a loss-reweighting (during training) and an evaluative summary (after training) (Fakorede et al., 6 Mar 2024, Zeng et al., 2020).

6. Practical Considerations and Limitations

Assumptions and normalization: WAVS calculations assume access to per-instance logits, risk labels, or adversarial deltas as appropriate. Normalization choices (e.g., min–max) can be sensitive to outliers, and must be consistent across evaluations for fair comparison (Sahoo et al., 11 Dec 2025). In peer review settings, mislabeling of paper risk can distort WAVS.

Domain-specific weighting: The parameters governing the mix of components (e.g., $w_S$ , $w_F$ , $w_R$ for LLMs; $w_1$ , $w_2$ for VLMs) are currently heuristic but allow for alignment with domain risk profiles. Sensitivity analyses indicate that stable comparative rankings are obtained if critical components (flip severity/risk) receive at least 60% cumulative weight (Sahoo et al., 11 Dec 2025).

Threat space coverage: For an adversarially comprehensive WAVS, representative attack and perturbation strategies must be included. The metric does not account for "backfire" robustness (cases where attacks inadvertently increase accuracy), which is a limitation in its current forms (Sahoo et al., 11 Dec 2025).

7. Open Questions and Future Directions

Open questions regarding WAVS include:

Optimal weighting for novel domains: The heuristic assignment of component weights invites further theoretical and empirical justification tailored to new application areas (e.g., grant review, medical triage).
Extension to multimodal and interactive systems: Initial WAVS variants exist for images, text, and multimodal I/O, but rigorous frameworks for video, tabular, and interactive decision-making are a prospective research direction (Sahoo et al., 11 Dec 2025).
Defensive strategies for WAVS minimization: Identifying adversarial training regimes, regularizers, or alignment objectives that directly minimize WAVS, particularly in the context of dynamic, adaptive attackers, remains an open engineering and theoretical problem.

WAVS provides a principled, adaptable, and interpretable lens on adversarial vulnerability, enabling rigorous cross-domain robustness evaluation and targeted defense development. Its practical deployment requires careful threat modeling, control of normalization and weighting, and continued calibration against evolving adversarial paradigms.