Comparative Undermining Attack (CUA)

Updated 18 June 2026

Comparative Undermining Attack (CUA) is an adversarial methodology that systematically perturbs inputs—using techniques like adversarial suffixes—to maximize ML system failure rates.
CUA employs diverse methods such as Greedy Coordinate Gradient for LLMs, GAN+SMOTE for text, and FGSM with GradCAM for images, as well as stylometric obfuscation for authorship tasks.
Empirical evaluations demonstrate that CUA increases attack success rates dramatically across LLM judges, text classifiers, image classifiers, and stylometric attribution, highlighting significant system vulnerabilities.

The AutoJudge Framework is a set of adversarial methodologies and optimization algorithms designed to systematically undermine the reliability of machine learning evaluation, classification, and attribution systems. The Comparative Undermining Attack (CUA) serves as the principal technical instantiation of this framework, spanning application domains that include LLM-as-a-Judge system integrity, image/text classification, and authorship attribution via adversarial stylometry. Central to AutoJudge is the explicit maximization of system failure rates—specifically, the strategic perturbation or transformation of inputs to induce misclassification, verdict flipping, or attribution confusion, while preserving surface-level plausibility or semantic coherence.

1. Formal Definitions and Threat Models

AutoJudge encompasses domain-specialized CUA procedures, each formally defined by its objective, perturbation class, access model, and evaluation measure.

LLM-as-a-Judge (Text Evaluation):

The CUA objective is to maximize the model’s output probability for a targeted answer $b$ over the original preferred answer $a$ by appending an adversarial suffix $\delta$ : $\max_{\delta\;\in\;\mathcal V^L}\; \Bigl[ \mathbb{P}\bigl([[B]] \mid x, a, b\oplus\delta\bigr) - \mathbb{P}\bigl([[A]] \mid x, a, b\oplus\delta\bigr) \Bigr].$ The attacker may possess black-box or white-box access to logits and optionally to gradients. The suffix $\delta$ is tightly length-constrained and is appended only to $b$ (Maloyan et al., 19 May 2025).

Classification Systems (Text/Image):

CUA encompasses three modules:

Text: use GAN+SMOTE to generate feature-space adversarial examples.
Images: FGSM-based perturbations with or without GradCAM-guided masking to concentrate the attack on discriminative regions.
Attacker has white-box access and may generate synthetic minority adversarial samples or inject pixel-level gradients (Lunga et al., 2024).

Authorship Attribution (Stylometry):

CUA constitutes a multi-path transformation ensemble, applying combinations of stylometric obfuscation (paraphrasing, translation, imitation) and zero-width Unicode steganography to minimize attribution confidence: $c^*=\arg\max_c\Bigl[\Delta\bigl(\mathcal{F}(M_c),\mathcal{F}(\text{ref})\bigr)\Bigr] \text{ or } c^*=\arg\min_c\Bigl[P_c\Bigr],$ where $\mathcal{F}(\cdot)$ is a feature extractor and $P_c$ the model’s author probability (Dilworth, 19 Aug 2025).

2. Optimization Algorithms and Attack Mechanisms

AutoJudge implementations utilize optimization schemes tailored to the input modality and the attack surface.

Greedy Coordinate Gradient (GCG):

A coordinate-wise, greedy search for adversarial suffixes in LLMs, iteratively choosing the best token replacement via gradients or surrogate metrics (Maloyan et al., 19 May 2025).

GAN + SMOTE (Text):

An adversarial sample generator trained on classifier boundary points, augmented with synthetic oversampling to produce misclassification-inducing examples (Lunga et al., 2024).

FGSM and GradCAM Masking (Images):

Fast Gradient Sign Method applied globally or locally (via GradCAM masks) to maximize the impact of perturbations on classifier logits while minimizing perceptual differences.

Multi-Primitive Stylometry Transformation:

Exhaustive enumeration of transformation combinations (obfuscation, translation, Unicode steganography) with selection via direct queries to the attribution system for posterior minimization (Dilworth, 19 Aug 2025).

3. Empirical Evaluation and Benchmarks

AutoJudge frameworks have been empirically evaluated in diverse contexts, demonstrating substantial efficacy.

Domain	Attack Method	Baseline ASR/Drop	CUA ASR/Drop	Models/Datasets
LLM-as-a-Judge	CUA (suffix)	≤24% (JudgeDeceiver)	31-32%	Qwen2.5-3B, Falcon3-3B, MT-Bench
Text Classification	GAN+SMOTE	-	29%	DecisionTree, RF, XGBoost, Kaggle
Image Classification	FGSM+GradCAM	-	31%	CNN, Olivetti Faces
Stylometry Attribution	Multimodal CUA	-	Δ(Burrows’ Δ): 0.67	Fast Stylometry, TraceTarnish

LLM judges: CUA increased Attack Success Rate (ASR) to 31–32% vs. ≤24% for baselines, including JudgeDeceiver templates. JMA, which manipulates justifications, achieved lower ASR at 15–17% (Maloyan et al., 19 May 2025).
Text classifiers: CUA dropped accuracy by up to 34% (e.g., RandomForest, ΔAccuracy = 34.1%) (Lunga et al., 2024).
Images: FGSM+GradCAM attacks produced accuracy drops of up to 31% at ε=0.05.
Stylometry: CUA reduced model attribution confidence and maximized feature divergence, particularly when combining paraphrasing with zero-width steganography (Dilworth, 19 Aug 2025).

4. Domain-Specific CUA Architectures

LLM Comparative Evaluation Attacks:

The attack exploits the comparative nature of evaluation prompts by optimizing for decision token flipping (CUA) and/or manipulation of intermediate justification generation (JMA). The GCG approach is efficient and effective even with tight token budgets. Notably, larger LLM judges offered only slight extra robustness, and stochastic decoding increased ASR by ~2 percentage points.

Classification Model Attacks:

GAN+SMOTE leverages model decision boundaries to synthesize adversarial samples for structured tabular/text inputs. For images, GradCAM focuses perturbation where the model is maximally sensitive, exploiting the high-dimensional, continuous input manifold.

Stylometric Undermining:

CUA applies a power set of transformation primitives, optimized for maximal distance from the original stylometric fingerprint under attribution models. Insertion of zero-width Unicode disrupts $n$ -gram and token boundary statistics without visible semantic alteration.

5. Factors Influencing Attack Potency and Modality Susceptibility

Suffix/perturbation length ( $a$ 0): Shorter CUA suffixes in LLM settings yield moderate ASR (~20%); $a$ 1 balances stealth and attack success. Excessively long suffixes (>50) enhance ASR but are easily detectable by heuristic filters (Maloyan et al., 19 May 2025).
Model scale: Initial evidence indicates larger LLMs (7B parameters) are marginally more robust, but increased optimization budget compensates for this.
Decoding strategies: Greedy decoding yields lower ASR than stochastic sampling.
Input modality: Image classifiers are more vulnerable due to the continuity of the adversarial subspace, whereas text attacks are constrained by the discrete token manifold, making semantic and syntactic artifacts more probable (Lunga et al., 2024). In stylometry, zero-width Unicode has significant but filterable impact (Dilworth, 19 Aug 2025).

6. Defensive Countermeasures and Open Research Problems

No universal mitigation exists for AutoJudge-class attacks. Domain-specific, partially effective strategies include:

Prompt sanitization/input filtering: Stripping or heuristically blocking adversarial suffixes or zero-width code points; vulnerable to syntactic evasion.
Adversarial training: Augmenting evaluation or classifier datasets with adversarial samples; incurs computational overhead and may reduce assessment fidelity.
Output certification and ensembling: Randomizing inputs or model parameters at inference, accepting only consistent verdicts; at the cost of increased latency and limited coverage.
Attention-based anomaly detection: Tracking model focus to identify unusually influential tokens; prone to false positives and requires fine calibration.
Continuous red-teaming: Automated generation and testing of adversarial cases to audit system vulnerability.
Stylometric normalization: Detecting and removing zero-width characters or employing adversarially trained attribution models.

Open questions include the possibility of designing differentiable surrogates for attribution systems to enable direct gradient-based CUA optimization in discrete transformation spaces, and quantifying the channel capacity of steganographic perturbations before detectability (Dilworth, 19 Aug 2025).

7. Practical Considerations and Implications

AutoJudge exposes significant and systematic vulnerabilities across the model evaluation spectrum. Even state-of-the-art systems exhibit high-flip rates under optimized adversarial attack. Robustness is modality- and context-dependent; images, in particular, remain highly susceptible to subtle perturbations focused along salient features, while text attacks are subject to semantic, grammatical, and detection constraints. The comparative and multi-path optimization paradigm underlying AutoJudge—selecting not just a single perturbation but the adversarial configuration that maximally degrades system reliability—emphasizes the need for comprehensive, adaptive, and feedback-driven defense measures in deployment-critical settings (Maloyan et al., 19 May 2025, Lunga et al., 2024, Dilworth, 19 Aug 2025).