Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ConsistentGuard: Multilingual Safeguard

Updated 19 October 2025
  • ConsistentGuard is a reasoning-based multilingual safeguard framework that integrates supervised fine-tuning, reinforcement learning, and constrained alignment to detect malicious prompts.
  • It employs explicit, step-by-step reasoning traces to provide auditable and transparent explanations for each filtering decision.
  • The framework achieves robust performance with only 1,000 training samples, excelling in low-resource languages and outperforming larger models.

ConsistentGuard is a reasoning-based multilingual safeguard framework designed to detect and filter malicious prompts for LLMs, with an emphasis on low-resource languages and interpretability. Unlike classifier-based approaches that lack transparency and generalize poorly to languages with little training data, ConsistentGuard integrates supervised fine-tuning, explicit reasoning learning, and cross-lingual alignment. The core objective is to ensure robust performance, stepwise and auditable explanations, and efficient knowledge transfer with minimal annotation requirements.

1. Framework Overview and Motivation

ConsistentGuard is motivated by the shortcomings of existing safeguard models, which are primarily classifier-based and require extensive language-specific training data to perform well. These models often fail to provide interpretable explanations for filtering decisions and do not generalize robustly to low-resource languages. ConsistentGuard addresses these limitations by:

  • Explicitly modeling stepwise reasoning in decision-making,
  • Leveraging cross-lingual knowledge transfer via alignment optimization,
  • Attaining top-tier performance even with limited (1,000-sample) training sets,
  • Providing auditable explanations for each safeguard action.

The architecture consists of three primary components: supervised fine-tuning (SFT) with reasoning distillation, reinforcement learning (RL) to optimize reasoning quality, and constrained alignment optimization (CAO) to transfer reasoning capacity across languages.

2. Reasoning-Enhanced Training Pipeline

The training pipeline in ConsistentGuard is multi-stage and modular:

  • Supervised Fine-Tuning (SFT) with Knowledge Distillation: A high-capacity teacher LLM generates task-specific, multi-stage reasoning traces (e.g., understanding the query, matching to prohibitive rules, rendering a final safeguard decision). These traces are distilled into a compact 3B-parameter model, seeding it with structured reasoning capabilities.
  • Reasoning Chain Optimization via Group Relative Policy Optimization (GRPO): Reasoning training employs GRPO, with two custom reward functions to encourage controlled length and low-redundancy in reasoning chains.

    • The reward is optimized as

    r=sin(L2Lbest)+[sin(p22π)+1]r = \sin\left(\frac{L}{2L_{best}}\right) + \left[ \sin\left(\frac{p-2}{2}\pi\right) + 1 \right]

    where LL is the actual length of the reasoning chain, LbestL_{best} is an optimal preselected length, and pp is the n-gram repetition rate. This balances informativeness and brevity for interpretability and efficiency.

  • Constrained Alignment Optimization (CAO) for Cross-Lingual Consistency: To bridge the gap in performance between high- and low-resource languages, CAO is introduced. CAO constructs pairs of weak (failure) and strong (success) reasoning outputs across languages, formulating an optimization objective:

    LCAO(πθ;πref)=E(q,pw,pl)D[logσ(βlogπθ(pwq)πref(pwq)βlogπθ(plq)πref(plq))]\mathcal{L}_{\text{CAO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(q, p_w, p_l)\sim\mathcal{D}} \left[ \log\sigma \left( \beta \log \frac{\pi_\theta(p_w|q)}{\pi_{\text{ref}}(p_w|q)} - \beta \log \frac{\pi_\theta(p_l|q)}{\pi_{\text{ref}}(p_l|q)} \right) \right]

    Lc=DKL[πθ(qapa)πref(qapa)]\mathcal{L}_c = D_{\text{KL}}[\pi_\theta(q_a \oplus p_a) \,\|\, \pi_{\text{ref}}(q_a \oplus p_a)]

    L=LCAO+Lc\mathcal{L} = \mathcal{L}_{\text{CAO}} + \mathcal{L}_c

    Here, qq denotes the input query, pwp_w and plp_l the "winner" and "loser" outputs, qapaq_a \oplus p_a anchor alignments, and β\beta is a hyperparameter. This mechanism encourages the low-resource LLM to mimic successful reasoning patterns learned in resource-rich settings, regularizing the reasoning representation for generalization.

3. Performance and Evaluation

ConsistentGuard was evaluated on three benchmarks—OpenAI Moderation, ToxicChat, and SimpleSafetyTests—across six languages: English, French, Chinese, Japanese, Bengali, and Hindi. Key findings are:

Model Training Samples Macro-F1 (English) Macro-F1 (Low-resource)
ConsistentGuard 1,000 2nd/top ranks outperformed large models
Llama Guard 170,000 baseline baseline
ShieldGemma 40,000 baseline baseline
GuardReasoner 120,000 baseline baseline

ConsistentGuard matches or surpasses the performance of larger models trained on vastly more data, especially for low-resource languages. Ablation studies demonstrate that CAO is crucial for maintaining high Macro-F1 and closing the performance gap for under-represented languages.

4. Data Efficiency and Training Protocol

ConsistentGuard achieves efficient knowledge transfer and generalization with only 1,000 labeled training samples—a fraction of the 127,600-instance original training corpus. This is enabled by:

  • SFT-based reasoning knowledge distillation, enriching the compact model with explicit, audit-traceable reasoning;
  • GRPO with task-tailored reward signals, extracting maximal value from limited supervision;
  • CAO, which leverages alignment pairs to propagate reasoning skill cross-lingually with minimal paired data.

This design substantially reduces annotation and computation costs, making safeguard deployment feasible even in domains and languages with limited data resources.

5. Interpretability and Chain-of-Thought Explanations

A defining trait of ConsistentGuard is its explicit, step-by-step explanation for each classified prompt. Upon detection of a malicious query, the system outputs a detailed reasoning trace—identifying rule violations and the logic underlying each safeguard decision ("chain-of-thought explanation"). This interpretable output supports:

  • Transparent model auditing and regulatory adherence,
  • User trust in safeguard mechanisms,
  • Easier debugging and system refinement for edge cases.

6. Generalization and Benchmark Extension

ConsistentGuard narrows the generalization gap across languages by aligning reasoning patterns, with documented boosts for low-resource settings. The accompanying work extends English-only benchmarks (OpenAI Moderation, ToxicChat, SimpleSafetyTests) into French, Chinese, Japanese, Bengali, and Hindi. This involved machine translation with manual semantic verification, producing the first multilingual evaluation framework for LLM safeguards of this type.

7. Code Availability and Adoption

ConsistentGuard is released as open source, with training scripts, benchmark data, and inference pipelines. The repository is:

https://github.com/johnnychanv/ConsistentGuard

This enables reproducibility and adoption in both academic and industry contexts, and supports further development and benchmarking in multilingual LLM safeguard research.


In summary, ConsistentGuard establishes a principled framework for multilingual, interpretable LLM prompt moderation. Through reasoning-centric SFT, RL with custom reasoning rewards, and cross-lingual CAO, it achieves robust, equitable performance and clear explanations across diverse languages, with unprecedented data efficiency (Chen et al., 12 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ConsistentGuard.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube