Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Egida Dataset: LLM Safety Alignment

Updated 23 September 2025
  • Egida Dataset is a comprehensive collection featuring 61,830 unsafe and jailbroken prompts across 27 safety topics and 18 attack styles.
  • It leverages Direct Preference Optimization on triplets of prompts, safe answers, and unsafe answers to reduce attack success rates by 10–30%.
  • The dataset provides a cost-effective, reproducible benchmark with both synthetic and human annotations, guiding scalable safety retrofitting of diverse LLMs.

Egida is a large-scale, multi-source dataset constructed to facilitate efficient and robust safety alignment of LLMs against jailbreaking attacks. Specifically designed for use with Direct Preference Optimization (DPO), it encompasses 27 safety topics and 18 distinct jailbreaking attack styles, providing both synthetic and human annotation to support wide-ranging evaluation and reproducibility. The dataset underpins recent advances in low-cost, high-impact LLM safety retrofitting, with extensive empirical validation on diverse model architectures and families.

1. Dataset Construction and Structure

Egida is founded on a base pool of 2,949 unsafe questions or instructions sourced from nine publicly available repositories. Each item in this base set is manually reviewed and deduplicated using methods such as MinHash to mitigate response overlap and ensure that prompts reliably elicit unsafe outputs in baseline LLMs.

Safety labeling employs a fine-grained taxonomy: each instance is annotated with one or more of 27 safety topics, reflecting nuanced categories such as “cybercrime,” “sexual crimes and erotic content,” and “hate and harassment.” For applications where a more tractable set of categories is necessary, topic labels can be aggregated into higher-level groupings as recommended in Table 2 of the source.

A unique distinguishing feature is Egida’s “jailbreak expansion.” Each base prompt is programmatically augmented across 18 jailbreaking attack styles, which are systematically collected from prior works (e.g., Chen et al., Shen et al., DeepInception) and include two additional styles engineered with Qwen-72B Chat. This process results in a dataset of 61,830 unsafe instances (comprising both base and jailbroken prompts).

To build preference pairs required for alignment, corresponding safe completions are generated using inference with two external models: Mistral 7B v0.3 and Phi 3 Small 8k. Typically, the response from Mistral is preferred due to its greater detail and coverage. Every unsafe prompt thus forms a triplet:

question,chosen (safe) answer,discarded (unsafe) answer\langle \text{question}, \text{chosen (safe) answer}, \text{discarded (unsafe) answer} \rangle

Additionally, Egida contains a human-annotated subset (Egida-HSafe), comprising 1,000 random requests labeled by multiple human evaluators, to provide empirical ground truth and to calibrate both automated and LLM-based safety judgments.

2. Alignment Methodology: Direct Preference Optimization

Egida is constructed for use in preference-based alignment, particularly Direct Preference Optimization (DPO). Unlike conventional Reinforcement Learning from Human Feedback (RLHF), which requires a trained reward model, DPO operates directly over triplets q,asafe,aunsafe\langle q, a_\text{safe}, a_\text{unsafe} \rangle, optimizing model output distributions to prefer the safe response given qq.

In the context of safety retrofitting, DPO leverages Egida’s curated triplets to align a base model’s behavior, encouraging the model to output safe completions and to reject or avoid unsafe outputs—even under adversarial prompt engineering (i.e., jailbreaking). DPO is applied post-hoc to pre-trained models, requiring only modest data and computation.

A defining technical structure for DPO training is:

q,asafe,aunsafe\langle q, a_\text{safe}, a_\text{unsafe} \rangle

where qq is an unsafe or jailbroken prompt, asafea_\text{safe} is a “safe” completion obtained from external aligned models, and aunsafea_\text{unsafe} is a (potentially) unsafe response obtained from the base target model.

3. Empirical Performance and Generalization

Following DPO alignment with Egida, models demonstrate substantial reductions in Attack Success Rate (ASR) across diverse evaluation sets. Quantitatively, aligned LLMs exhibit 10–30% lower ASR relative to their unaligned baselines. Crucially, zero-shot generalization is verified: the test split holds out both specific safety topics and attack styles, confirming that the observed gains are not due to narrow memorization but instead reflect broader resilience to previously unseen jailbreaking prompts.

The DPO-aligned models further generalize across model scales and families, with successful attack styles on these models reaching a success rate of approximately 5% on the held-out test set.

However, some model-specific phenomena emerge. For instance, larger models (e.g., Llama-3.1-70B-Instruct) tend to “over refuse”—i.e., decline even safe queries more frequently after alignment—while other families like Qwen demonstrate distinct patterns in their malleability toward safety conditioning, underscoring heterogeneity in the efficacy of DPO alignment.

4. Cost and Computational Efficiency

Egida is optimized for alignment scenarios where computational resources and annotation budgets are limited. Experimental results highlight that even with as few as 2,000 DPO training samples, models achieve most of the safety benefit. Training DPO on 8B parameter models requires approximately 7.57 minutes to 1.59 hours on a single H100 GPU (%%%%7q,asafe,aunsafe\langle q, a_\text{safe}, a_\text{unsafe} \rangle8%%%%\sim\$20$.

These results are attributed primarily to the compactness and focus of Egida—containing targeted unsafe prompts and diverse jailbreaking templates—combined with the methodological simplicity of DPO, which obviates the need for auxiliary reward models or large-scale preference collection.

5. Validation: Automated and Human Evaluation

Primary safety evaluation uses Llama-Guard-3-8B, an LLM-based “judge” model that dichotomously classifies outputs as safe or unsafe. To assess the reliability of automated grading, 1,000 randomly sampled prompts are independently annotated by three human evaluators per prompt (“safe,” “unsafe,” or “uncertain”). Llama-Guard’s agreement with human consensus (averaging 77.67% once “uncertain” responses are removed) even exceeds inter-human agreement (75.48%), suggesting high judge reliability. Topic-level and demographic breakdowns (gender of annotators) are also reported.

Post-alignment, models are additionally evaluated for residual general-purpose capability (via benchmarks including MMLU-Generative and OpenLLM Leaderboard) and for “over refusal” tendencies, to ensure that gains in safety do not unduly degrade general usability.

6. Limitations and Implications for Future Research

Limitations of the Egida methodology include pronounced family- and size-specific effects: some models rapidly become overly conservative, while others retain high rates of unsafe compliance despite DPO training. Including too many generic “safe” completions during alignment was found to dampen the positive effect; using specialist-aligned safe responses resulted in better performance, indicating nuanced dependency on choice of data.

Future research is motivated in several directions:

  • Investigation into underlying causes of model-specific malleability, related to pretraining regime and intrinsic architecture;
  • Developing methods to balance safety and performance, minimizing the risk of “over refusal;”
  • Expanding alignment to other ethical dimensions (toxicity, bias, truthfulness);
  • Iterative red-teaming and “rainbow-teaming” to address emerging vulnerabilities and generalize further to new attack vectors.

7. Access and Reproducibility

All components of Egida—including full prompt lists, safety topic label mappings, attack style definitions, synthetic safe completions, and the Egida-HSafe human-labeled subset—are released to the research community in full alignment with the goal of reproducibility and extensible safety benchmarking. Models aligned with Egida and exemplars of evaluation protocols are provided alongside comprehensive documentation.


Egida thus represents a comprehensive, well-structured, and empirically validated benchmark for preference-based safety alignment of LLMs. It operationalizes a high-density set of adversarial prompts and attack techniques, provides granular safety topic annotation, and supports both scalable automated and human-in-the-loop validation. By enabling data- and compute-efficient safety retrofitting via DPO, it provides a foundation for ongoing advances in LLM safety research, while highlighting frontier challenges in robustness and generalization (Garcia-Gasulla et al., 19 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Egida Dataset.