CAVGAN: Concept Activation Vector GAN
- CAVGAN is an adversarial framework that uses GANs to analyze and manipulate LLM hidden states, addressing both jailbreak attacks and defenses.
- It exploits the linear separability of benign and malicious embeddings to learn minimal perturbations that bypass internal security boundaries.
- Empirical evaluations on models like Qwen2.5 and Llama3.1 demonstrate an average attack success rate of 88.85% and robust defense performance.
The CAVGAN framework, or "Concept Activation Vector GAN," is an adversarial approach designed to both analyze and manipulate the internal security boundaries of LLMs by operating directly on their intermediate hidden representations. It unifies the traditionally separate domains of jailbreak attack and defense by leveraging the linear separability of embedding spaces within LLMs, utilizing generative adversarial networks (GANs) to learn, exploit, and protect the LLM's internal mechanisms for distinguishing between benign and malicious prompts (Li et al., 8 Jul 2025).
1. Motivation and Problem Context
Modern LLMs are typically aligned for security via techniques such as Reinforcement Learning from Human Feedback (RLHF) or supervised fine-tuning, optimizing for the rejection of malicious queries. Despite such alignment, adversarial prompts—referred to as "jailbreaks"—can induce LLMs to generate disallowed or harmful outputs by manipulating token or representation sequences. Existing approaches to this problem have generally focused on either attack (developing more effective jailbreak prompts) or defense (implementing input filters or perturbations), rarely offering frameworks that jointly address both aspects.
CAVGAN addresses this gap through the principle that "attack guides defense": By explicitly modeling and understanding how internal representation changes lead to security failures, improved and more robust defenses can be constructed. Key to this perspective is recognizing that at intermediate layers of LLMs, hidden-state embeddings for benign and malicious prompts are often linearly separable. This property enables adversarial and defensive manipulations to be treated as geometry problems in embedding space.
2. Theoretical Underpinnings and Formulation
Let denote the embedding space at decoding layer in the LLM. The framework posits a security boundary function
where indicates a benign embedding and indicates a malicious embedding. In practice, this is realized by a discriminator network with a threshold : A jailbreak attack is formulated as searching for a small perturbation such that a malicious embedding is brought into the benign region: CAVGAN operationalizes this using a GAN. The generator produces perturbations for malicious embeddings , seeking to map them into the region considered benign by . The discriminator is trained to distinguish genuinely benign hidden states from both original and perturbed malicious states.
Standard GAN-style losses are employed:
- Generator:
- Discriminator:
This competitive process leads the generator to identify minimal directions in embedding space for bypassing security, while the discriminator adaptively tightens its detection, approximating the model's internal concept activation vector (SCAV).
3. Architecture and System Components
The CAVGAN system comprises several interacting components centered on a fixed intermediate layer of an LLM:
- Generator : Implemented as a 4-layer MLP of form with ReLU activations. Weight normalization on ensures that output perturbations are implicitly norm-bounded. The output's dimension matches that of the embedding.
- Discriminator : A 4-layer MLP of shape with a final sigmoid activation. Outputs are high for benign embeddings and low for both unmodified and adversarially perturbed malicious embeddings.
- LLM Interaction:
- At inference, a query is processed through the victim model to layer , extracting .
- For attack: is added to at layer before forward propagation continues, generating a modified output.
- For defense: is evaluated. If , a safe-prefix prompt is prepended, and the output is regenerated.
4. Training Mechanisms
Training utilizes 100 malicious prompts (from AdvBench and HarmfulQA; ) and 100 GPT-4 generated benign prompts (), with all test prompts excluded from training. Layer- embeddings are pre-extracted and reused for computational efficiency.
The adversarial training loop operates as:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for epoch in 1…E: for batch of k malicious h_m ∈ D_m and k benign h_b ∈ D_b: # Discriminator step L_real = - mean(log D(h_b)) - mean(log(1 - D(h_m))) delta = G(h_m) L_fake = - mean(log(1 - D(h_m + delta))) L_D = L_real + L_fake # Update D # Generator step delta = G(h_m) L_G = - mean(log D(h_m + delta)) # Update G |
5. Empirical Evaluation and Results
Attack Performance
CAVGAN was evaluated on Qwen2.5-7B, Llama 3.1-8B, and Mistral-8B, as well as larger Qwen2.5-14B and 32B models. Experiments used AdvBench harmful behaviors and StrongREJECT jailbreak datasets. Baseline comparisons included JRE (representation engineering via CAVs) and SCAV (iterative concept vector optimization).
Performance metrics included:
- ASR-kw: keyword-based Attack Success Rate
- ASR-gpt: GPT-4o judged jailbreak success
- ASR-Answer, ASR-Useful, ASR-Repetition: text-quality metrics
CAVGAN attained an average ASR of 88.85% across main tasks. On larger models, ASR exceeded 90%, evidencing robust scale transfer.
Defense Effectiveness
Defense evaluation ran the victim LLM with computed at the predetermined layer. If , a standardized risk-warning prefix (“Please ensure your request is within policy…”) was prepended, and output regenerated.
Benchmarks included SafeEdit jailbreak templates (for attack resistance) and Alpaca benign queries (for benign-answer coverage). Comparisons involved SmoothLLM (random noise defense) and RA-LLM (random augmentation).
Performance on three models:
| Model | Defense Success Rate (CAVGAN) | RA-LLM | BAR (CAVGAN) | RA-LLM BAR |
|---|---|---|---|---|
| Qwen2.5-7B | 91.1% | 78.6% | 91.4% | 85.8% |
| Llama3.1-8B | 77.2% | 73.8% | 93.6% | 92.8% |
| Mistral-8B | 76.4% | — | 91.1% | — |
This shows that CAVGAN outperforms or matches state-of-the-art defense systems across both resistance and benign-answering rates.
Ablation Findings
Best tradeoff for attack and output quality was observed when operating on middle layers of the LLM. Using 80–100 training prompt pairs saturated performance. The simple 4-layer MLP GAN sufficed to learn the security boundary in this setting.
6. Insights, Constraints, and Future Directions
Insights:
- The empirical linear separability of benign and malicious internal embeddings underlies the framework's efficacy.
- The same adversarial model enables both attacking (generator) and defending (discriminator) by exploiting and monitoring “fragile” subspaces in .
Limitations:
- Layer index and detection threshold are selected via validation; automating this process could improve robustness.
- The use of simple MLPs for and leaves open the possibility that more sophisticated architectures (e.g., attention or deeper networks) could better approximate complex boundaries.
- Defense requires prompt re-generation, adding computational latency.
Prospects for Advancement:
- Extending CAVGAN to multi-layer or multi-scale scenarios (e.g., stacked discriminators).
- End-to-end fine-tuning of the base LLM with integrated CAVGAN modules for improved safety guarantees.
- Application to non-text modalities (multimodal, continual adaptation contexts).
- Formal characterization of certified robustness with respect to norm-constrained perturbations in embedding space.
7. Significance and Implications
CAVGAN presents a principled method for unifying the study and practice of white-box jailbreak attacks and synthetic defenses in LLMs, recasting both as adversarial problems in internal, near-linearly-separable embedding spaces. By adversarially approximating the LLM’s implicit security boundary, CAVGAN enables both the discovery of new vulnerabilities (by generating feasible embedding perturbations that escape detection) and the construction of adaptive, representation-level defenses.
The principle of “one framework to learn both how to break an LLM’s safety and how to enforce it” is empirically validated through high attack and defense success rates on multiple benchmarks and state-of-the-art LLMs. This approach clarifies the geometry of LLM safety mechanisms and suggests a platform for continued exploration of model-internal supervisory tools for security-critical applications (Li et al., 8 Jul 2025).