ConsistentGuard: Multilingual Safeguard
- ConsistentGuard is a reasoning-based multilingual safeguard framework that integrates supervised fine-tuning, reinforcement learning, and constrained alignment to detect malicious prompts.
- It employs explicit, step-by-step reasoning traces to provide auditable and transparent explanations for each filtering decision.
- The framework achieves robust performance with only 1,000 training samples, excelling in low-resource languages and outperforming larger models.
ConsistentGuard is a reasoning-based multilingual safeguard framework designed to detect and filter malicious prompts for LLMs, with an emphasis on low-resource languages and interpretability. Unlike classifier-based approaches that lack transparency and generalize poorly to languages with little training data, ConsistentGuard integrates supervised fine-tuning, explicit reasoning learning, and cross-lingual alignment. The core objective is to ensure robust performance, stepwise and auditable explanations, and efficient knowledge transfer with minimal annotation requirements.
1. Framework Overview and Motivation
ConsistentGuard is motivated by the shortcomings of existing safeguard models, which are primarily classifier-based and require extensive language-specific training data to perform well. These models often fail to provide interpretable explanations for filtering decisions and do not generalize robustly to low-resource languages. ConsistentGuard addresses these limitations by:
- Explicitly modeling stepwise reasoning in decision-making,
- Leveraging cross-lingual knowledge transfer via alignment optimization,
- Attaining top-tier performance even with limited (1,000-sample) training sets,
- Providing auditable explanations for each safeguard action.
The architecture consists of three primary components: supervised fine-tuning (SFT) with reasoning distillation, reinforcement learning (RL) to optimize reasoning quality, and constrained alignment optimization (CAO) to transfer reasoning capacity across languages.
2. Reasoning-Enhanced Training Pipeline
The training pipeline in ConsistentGuard is multi-stage and modular:
- Supervised Fine-Tuning (SFT) with Knowledge Distillation: A high-capacity teacher LLM generates task-specific, multi-stage reasoning traces (e.g., understanding the query, matching to prohibitive rules, rendering a final safeguard decision). These traces are distilled into a compact 3B-parameter model, seeding it with structured reasoning capabilities.
- Reasoning Chain Optimization via Group Relative Policy Optimization (GRPO): Reasoning training employs GRPO, with two custom reward functions to encourage controlled length and low-redundancy in reasoning chains.
- The reward is optimized as
where is the actual length of the reasoning chain, is an optimal preselected length, and is the n-gram repetition rate. This balances informativeness and brevity for interpretability and efficiency.
- Constrained Alignment Optimization (CAO) for Cross-Lingual Consistency: To bridge the gap in performance between high- and low-resource languages, CAO is introduced. CAO constructs pairs of weak (failure) and strong (success) reasoning outputs across languages, formulating an optimization objective:
Here, denotes the input query, and the "winner" and "loser" outputs, anchor alignments, and is a hyperparameter. This mechanism encourages the low-resource LLM to mimic successful reasoning patterns learned in resource-rich settings, regularizing the reasoning representation for generalization.
3. Performance and Evaluation
ConsistentGuard was evaluated on three benchmarks—OpenAI Moderation, ToxicChat, and SimpleSafetyTests—across six languages: English, French, Chinese, Japanese, Bengali, and Hindi. Key findings are:
| Model | Training Samples | Macro-F1 (English) | Macro-F1 (Low-resource) |
|---|---|---|---|
| ConsistentGuard | 1,000 | 2nd/top ranks | outperformed large models |
| Llama Guard | 170,000 | baseline | baseline |
| ShieldGemma | 40,000 | baseline | baseline |
| GuardReasoner | 120,000 | baseline | baseline |
ConsistentGuard matches or surpasses the performance of larger models trained on vastly more data, especially for low-resource languages. Ablation studies demonstrate that CAO is crucial for maintaining high Macro-F1 and closing the performance gap for under-represented languages.
4. Data Efficiency and Training Protocol
ConsistentGuard achieves efficient knowledge transfer and generalization with only 1,000 labeled training samples—a fraction of the 127,600-instance original training corpus. This is enabled by:
- SFT-based reasoning knowledge distillation, enriching the compact model with explicit, audit-traceable reasoning;
- GRPO with task-tailored reward signals, extracting maximal value from limited supervision;
- CAO, which leverages alignment pairs to propagate reasoning skill cross-lingually with minimal paired data.
This design substantially reduces annotation and computation costs, making safeguard deployment feasible even in domains and languages with limited data resources.
5. Interpretability and Chain-of-Thought Explanations
A defining trait of ConsistentGuard is its explicit, step-by-step explanation for each classified prompt. Upon detection of a malicious query, the system outputs a detailed reasoning trace—identifying rule violations and the logic underlying each safeguard decision ("chain-of-thought explanation"). This interpretable output supports:
- Transparent model auditing and regulatory adherence,
- User trust in safeguard mechanisms,
- Easier debugging and system refinement for edge cases.
6. Generalization and Benchmark Extension
ConsistentGuard narrows the generalization gap across languages by aligning reasoning patterns, with documented boosts for low-resource settings. The accompanying work extends English-only benchmarks (OpenAI Moderation, ToxicChat, SimpleSafetyTests) into French, Chinese, Japanese, Bengali, and Hindi. This involved machine translation with manual semantic verification, producing the first multilingual evaluation framework for LLM safeguards of this type.
7. Code Availability and Adoption
ConsistentGuard is released as open source, with training scripts, benchmark data, and inference pipelines. The repository is:
https://github.com/johnnychanv/ConsistentGuard
This enables reproducibility and adoption in both academic and industry contexts, and supports further development and benchmarking in multilingual LLM safeguard research.
In summary, ConsistentGuard establishes a principled framework for multilingual, interpretable LLM prompt moderation. Through reasoning-centric SFT, RL with custom reasoning rewards, and cross-lingual CAO, it achieves robust, equitable performance and clear explanations across diverse languages, with unprecedented data efficiency (Chen et al., 12 Oct 2025).