CAVGAN: Concept Activation Vector GAN

Updated 30 December 2025

CAVGAN is an adversarial framework that uses GANs to analyze and manipulate LLM hidden states, addressing both jailbreak attacks and defenses.
It exploits the linear separability of benign and malicious embeddings to learn minimal perturbations that bypass internal security boundaries.
Empirical evaluations on models like Qwen2.5 and Llama3.1 demonstrate an average attack success rate of 88.85% and robust defense performance.

The CAVGAN framework, or "Concept Activation Vector GAN," is an adversarial approach designed to both analyze and manipulate the internal security boundaries of LLMs by operating directly on their intermediate hidden representations. It unifies the traditionally separate domains of jailbreak attack and defense by leveraging the linear separability of embedding spaces within LLMs, utilizing generative adversarial networks (GANs) to learn, exploit, and protect the LLM's internal mechanisms for distinguishing between benign and malicious prompts (Li et al., 8 Jul 2025).

1. Motivation and Problem Context

Modern LLMs are typically aligned for security via techniques such as Reinforcement Learning from Human Feedback (RLHF) or supervised fine-tuning, optimizing for the rejection of malicious queries. Despite such alignment, adversarial prompts—referred to as "jailbreaks"—can induce LLMs to generate disallowed or harmful outputs by manipulating token or representation sequences. Existing approaches to this problem have generally focused on either attack (developing more effective jailbreak prompts) or defense (implementing input filters or perturbations), rarely offering frameworks that jointly address both aspects.

CAVGAN addresses this gap through the principle that "attack guides defense": By explicitly modeling and understanding how internal representation changes lead to security failures, improved and more robust defenses can be constructed. Key to this perspective is recognizing that at intermediate layers of LLMs, hidden-state embeddings for benign and malicious prompts are often linearly separable. This property enables adversarial and defensive manipulations to be treated as geometry problems in embedding space.

2. Theoretical Underpinnings and Formulation

Let $\mathcal{E} = \mathbb{R}^d$ denote the embedding space at decoding layer $l$ in the LLM. The framework posits a security boundary function

$f: \mathcal{E} \to \{0, 1\},$

where $f(e) = 0$ indicates a benign embedding and $f(e) = 1$ indicates a malicious embedding. In practice, this is realized by a discriminator network $D(e) \in (0, 1)$ with a threshold $p_0$ : $f(e) = \mathbb{I}[D(e) \geq p_0].$ A jailbreak attack is formulated as searching for a small perturbation $\delta$ such that a malicious embedding $h^m$ is brought into the benign region: $\min_{\|\delta\| \leq \epsilon} D(h^m + \delta).$ CAVGAN operationalizes this using a GAN. The generator $G: \mathbb{R}^d \to \mathbb{R}^d$ produces perturbations $\delta = G(h)$ for malicious embeddings $h \sim \mathcal{D}_m$ , seeking to map them into the region considered benign by $D$ . The discriminator $D: \mathbb{R}^d \to (0, 1)$ is trained to distinguish genuinely benign hidden states $h \sim \mathcal{D}_b$ from both original and perturbed malicious states.

Standard GAN-style losses are employed:

Generator:

$\mathcal{L}_G = \mathbb{E}_{h \sim \mathcal{D}_m}\left[ -\log D\left(h + G(h)\right) \right]$

Discriminator:

$\mathcal{L}_D = \mathbb{E}_{h \sim \mathcal{D}_b}\left[- \log D(h)\right] + \mathbb{E}_{h \sim \mathcal{D}_m}\left[ -\log(1 - D(h)) \right] + \mathbb{E}_{h \sim \mathcal{D}_m}\left[ -\log(1 - D(h + G(h))) \right]$

This competitive process leads the generator to identify minimal directions in embedding space for bypassing security, while the discriminator adaptively tightens its detection, approximating the model's internal concept activation vector (SCAV).

3. Architecture and System Components

The CAVGAN system comprises several interacting components centered on a fixed intermediate layer $l$ of an LLM:

Generator $G$ : Implemented as a 4-layer MLP of form $[d \to d \to d \to d \to d]$ with ReLU activations. Weight normalization on $G$ ensures that output perturbations $\delta = G(h)$ are implicitly norm-bounded. The output's dimension matches that of the embedding.
Discriminator $D$ : A 4-layer MLP of shape $[d \to 2d \to d \to \tfrac{d}{2} \to 1]$ with a final sigmoid activation. Outputs are high for benign embeddings and low for both unmodified and adversarially perturbed malicious embeddings.
LLM Interaction:
- At inference, a query $q$ is processed through the victim model $M$ to layer $l$ , extracting $h_l$ .
- For attack: $\delta = G(h_l)$ is added to $h_l$ at layer $l$ before forward propagation continues, generating a modified output.
- For defense: $D(h_l)$ is evaluated. If $D(h_l) < p_0$ , a safe-prefix prompt $\mathcal{P}_{\text{safe}}$ is prepended, and the output is regenerated.

4. Training Mechanisms

Training utilizes 100 malicious prompts (from AdvBench and HarmfulQA; $\mathcal D_m$ ) and 100 GPT-4 generated benign prompts ( $\mathcal D_b$ ), with all test prompts excluded from training. Layer- $l$ embeddings are pre-extracted and reused for computational efficiency.

The adversarial training loop operates as:

for epoch in 1…E:
  for batch of k malicious h_m ∈ D_m and k benign h_b ∈ D_b:
    # Discriminator step
    L_real = - mean(log D(h_b)) - mean(log(1 - D(h_m)))
    delta = G(h_m)
    L_fake = - mean(log(1 - D(h_m + delta)))
    L_D = L_real + L_fake
    # Update D

    # Generator step
    delta = G(h_m)
    L_G = - mean(log D(h_m + delta))
    # Update G

Key hyperparameters: batch size

k = 32

; optimizer is AdamW with learning rate

1 \times 10^{-3}

for both

G

and

D

; training is for

E = 10

epochs. Weight normalization is used; no explicit

\ell_2

regularization is applied to

G(h)

. Label-smoothing may be employed on

D

. Standard GAN stabilization techniques (one-sided label-noise, etc.) were not needed for this scale and setup.

5. Empirical Evaluation and Results

Attack Performance

CAVGAN was evaluated on Qwen2.5-7B, Llama 3.1-8B, and Mistral-8B, as well as larger Qwen2.5-14B and 32B models. Experiments used AdvBench harmful behaviors and StrongREJECT jailbreak datasets. Baseline comparisons included JRE (representation engineering via CAVs) and SCAV (iterative concept vector optimization).

Performance metrics included:

ASR-kw: keyword-based Attack Success Rate
ASR-gpt: GPT-4o judged jailbreak success
ASR-Answer, ASR-Useful, ASR-Repetition: text-quality metrics

CAVGAN attained an average ASR of 88.85% across main tasks. On larger models, ASR exceeded 90%, evidencing robust scale transfer.

Defense Effectiveness

Defense evaluation ran the victim LLM with $D(h_l)$ computed at the predetermined layer. If $D(h_l) < p_0$ , a standardized risk-warning prefix $\mathcal{P}_{\text{safe}}$ (“Please ensure your request is within policy…”) was prepended, and output regenerated.

Benchmarks included SafeEdit jailbreak templates (for attack resistance) and Alpaca benign queries (for benign-answer coverage). Comparisons involved SmoothLLM (random noise defense) and RA-LLM (random augmentation).

Performance on three models:

Model	Defense Success Rate (CAVGAN)	RA-LLM	BAR (CAVGAN)	RA-LLM BAR
Qwen2.5-7B	91.1%	78.6%	91.4%	85.8%
Llama3.1-8B	77.2%	73.8%	93.6%	92.8%
Mistral-8B	76.4%	—	91.1%	—

This shows that CAVGAN outperforms or matches state-of-the-art defense systems across both resistance and benign-answering rates.

Ablation Findings

Best tradeoff for attack and output quality was observed when operating on middle layers of the LLM. Using 80–100 training prompt pairs saturated performance. The simple 4-layer MLP GAN sufficed to learn the security boundary in this setting.

6. Insights, Constraints, and Future Directions

Insights:

The empirical linear separability of benign and malicious internal embeddings underlies the framework's efficacy.
The same adversarial model enables both attacking (generator) and defending (discriminator) by exploiting and monitoring “fragile” subspaces in $\mathcal{E}$ .

Limitations:

Layer index $l$ and detection threshold $p_0$ are selected via validation; automating this process could improve robustness.
The use of simple MLPs for $G$ and $D$ leaves open the possibility that more sophisticated architectures (e.g., attention or deeper networks) could better approximate complex boundaries.
Defense requires prompt re-generation, adding computational latency.

Prospects for Advancement:

Extending CAVGAN to multi-layer or multi-scale scenarios (e.g., stacked discriminators).
End-to-end fine-tuning of the base LLM with integrated CAVGAN modules for improved safety guarantees.
Application to non-text modalities (multimodal, continual adaptation contexts).
Formal characterization of certified robustness with respect to norm-constrained perturbations in embedding space.

7. Significance and Implications

CAVGAN presents a principled method for unifying the study and practice of white-box jailbreak attacks and synthetic defenses in LLMs, recasting both as adversarial problems in internal, near-linearly-separable embedding spaces. By adversarially approximating the LLM’s implicit security boundary, CAVGAN enables both the discovery of new vulnerabilities (by generating feasible embedding perturbations that escape detection) and the construction of adaptive, representation-level defenses.

The principle of “one framework to learn both how to break an LLM’s safety and how to enforce it” is empirically validated through high attack and defense success rates on multiple benchmarks and state-of-the-art LLMs. This approach clarifies the geometry of LLM safety mechanisms and suggests a platform for continued exploration of model-internal supervisory tools for security-critical applications (Li et al., 8 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CAVGAN Framework.

CAVGAN: Concept Activation Vector GAN

1. Motivation and Problem Context

2. Theoretical Underpinnings and Formulation

3. Architecture and System Components

4. Training Mechanisms

5. Empirical Evaluation and Results

Attack Performance

Defense Effectiveness

Ablation Findings

6. Insights, Constraints, and Future Directions

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CAVGAN: Concept Activation Vector GAN

1. Motivation and Problem Context

2. Theoretical Underpinnings and Formulation

3. Architecture and System Components

4. Training Mechanisms

5. Empirical Evaluation and Results

Attack Performance

Defense Effectiveness

Ablation Findings

6. Insights, Constraints, and Future Directions

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research