Vicuna-moderator-7B: Safety & Moderation Insights
- Vicuna-moderator-7B is a 7B-parameter LLM variant derived from Llama2-7B using QLoRA and LoRA techniques with embedded moderation through system prompts.
- The model’s safety mechanism leverages in-context learning instead of hard-coded refusal weights to dynamically address forbidden tasks.
- Trained on around 70,000 ShareGPT conversation pairs, it demonstrates enhanced adaptability and robustness in moderating sensitive content.
Vicuna-moderator-7B is an informally designated reference to the built-in moderation and safety behavior of the Vicuna-7B v1.5 LLM, as probed in the context of forbidden task robustness in "In-Context Learning Can Re-learn Forbidden Tasks" (Xhonneux et al., 2024). Derived from Llama2-7B using QLoRA and LoRA techniques on approximately 70,000 ShareGPT human–ChatGPT conversation pairs, Vicuna-7B's moderation capability depends primarily on a system prompt prepended at inference, rather than on hard-coded or separately fine-tuned refusal weights.
1. Model Origin and Safety Training Mechanism
Vicuna-7B v1.5, a 7B-parameter chat model