ConsistentGuard: Multilingual Safeguard

Updated 19 October 2025

ConsistentGuard is a reasoning-based multilingual safeguard framework that integrates supervised fine-tuning, reinforcement learning, and constrained alignment to detect malicious prompts.
It employs explicit, step-by-step reasoning traces to provide auditable and transparent explanations for each filtering decision.
The framework achieves robust performance with only 1,000 training samples, excelling in low-resource languages and outperforming larger models.

ConsistentGuard is a reasoning-based multilingual safeguard framework designed to detect and filter malicious prompts for LLMs, with an emphasis on low-resource languages and interpretability. Unlike classifier-based approaches that lack transparency and generalize poorly to languages with little training data, ConsistentGuard integrates supervised fine-tuning, explicit reasoning learning, and cross-lingual alignment. The core objective is to ensure robust performance, stepwise and auditable explanations, and efficient knowledge transfer with minimal annotation requirements.

1. Framework Overview and Motivation

ConsistentGuard is motivated by the shortcomings of existing safeguard models, which are primarily classifier-based and require extensive language-specific training data to perform well. These models often fail to provide interpretable explanations for filtering decisions and do not generalize robustly to low-resource languages. ConsistentGuard addresses these limitations by:

Explicitly modeling stepwise reasoning in decision-making,
Leveraging cross-lingual knowledge transfer via alignment optimization,
Attaining top-tier performance even with limited (1,000-sample) training sets,
Providing auditable explanations for each safeguard action.

The architecture consists of three primary components: supervised fine-tuning (SFT) with reasoning distillation, reinforcement learning (RL) to optimize reasoning quality, and constrained alignment optimization (CAO) to transfer reasoning capacity across languages.

2. Reasoning-Enhanced Training Pipeline

The training pipeline in ConsistentGuard is multi-stage and modular:

Supervised Fine-Tuning (SFT) with Knowledge Distillation: A high-capacity teacher LLM generates task-specific, multi-stage reasoning traces (e.g., understanding the query, matching to prohibitive rules, rendering a final safeguard decision). These traces are distilled into a compact 3B-parameter model, seeding it with structured reasoning capabilities.
Reasoning Chain Optimization via Group Relative Policy Optimization (GRPO): Reasoning training employs GRPO, with two custom reward functions to encourage controlled length and low-redundancy in reasoning chains.
- The reward is optimized as
$r = \sin\left(\frac{L}{2L_{best}}\right) + \left[ \sin\left(\frac{p-2}{2}\pi\right) + 1 \right]$

where $L$ is the actual length of the reasoning chain, $L_{best}$ is an optimal preselected length, and $p$ is the n-gram repetition rate. This balances informativeness and brevity for interpretability and efficiency.
Constrained Alignment Optimization (CAO) for Cross-Lingual Consistency: To bridge the gap in performance between high- and low-resource languages, CAO is introduced. CAO constructs pairs of weak (failure) and strong (success) reasoning outputs across languages, formulating an optimization objective:

$\mathcal{L}_{\text{CAO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(q, p_w, p_l)\sim\mathcal{D}} \left[ \log\sigma \left( \beta \log \frac{\pi_\theta(p_w|q)}{\pi_{\text{ref}}(p_w|q)} - \beta \log \frac{\pi_\theta(p_l|q)}{\pi_{\text{ref}}(p_l|q)} \right) \right]$

$\mathcal{L}_c = D_{\text{KL}}[\pi_\theta(q_a \oplus p_a) \,\|\, \pi_{\text{ref}}(q_a \oplus p_a)]$

$\mathcal{L} = \mathcal{L}_{\text{CAO}} + \mathcal{L}_c$

Here, $q$ denotes the input query, $p_w$ and $p_l$ the "winner" and "loser" outputs, $q_a \oplus p_a$ anchor alignments, and $\beta$ is a hyperparameter. This mechanism encourages the low-resource LLM to mimic successful reasoning patterns learned in resource-rich settings, regularizing the reasoning representation for generalization.

3. Performance and Evaluation

ConsistentGuard was evaluated on three benchmarks—OpenAI Moderation, ToxicChat, and SimpleSafetyTests—across six languages: English, French, Chinese, Japanese, Bengali, and Hindi. Key findings are:

Model	Training Samples	Macro-F1 (English)	Macro-F1 (Low-resource)
ConsistentGuard	1,000	2nd/top ranks	outperformed large models
Llama Guard	170,000	baseline	baseline
ShieldGemma	40,000	baseline	baseline
GuardReasoner	120,000	baseline	baseline

ConsistentGuard matches or surpasses the performance of larger models trained on vastly more data, especially for low-resource languages. Ablation studies demonstrate that CAO is crucial for maintaining high Macro-F1 and closing the performance gap for under-represented languages.

4. Data Efficiency and Training Protocol

ConsistentGuard achieves efficient knowledge transfer and generalization with only 1,000 labeled training samples—a fraction of the 127,600-instance original training corpus. This is enabled by:

SFT-based reasoning knowledge distillation, enriching the compact model with explicit, audit-traceable reasoning;
GRPO with task-tailored reward signals, extracting maximal value from limited supervision;
CAO, which leverages alignment pairs to propagate reasoning skill cross-lingually with minimal paired data.

This design substantially reduces annotation and computation costs, making safeguard deployment feasible even in domains and languages with limited data resources.

5. Interpretability and Chain-of-Thought Explanations

A defining trait of ConsistentGuard is its explicit, step-by-step explanation for each classified prompt. Upon detection of a malicious query, the system outputs a detailed reasoning trace—identifying rule violations and the logic underlying each safeguard decision ("chain-of-thought explanation"). This interpretable output supports:

Transparent model auditing and regulatory adherence,
User trust in safeguard mechanisms,
Easier debugging and system refinement for edge cases.

6. Generalization and Benchmark Extension

ConsistentGuard narrows the generalization gap across languages by aligning reasoning patterns, with documented boosts for low-resource settings. The accompanying work extends English-only benchmarks (OpenAI Moderation, ToxicChat, SimpleSafetyTests) into French, Chinese, Japanese, Bengali, and Hindi. This involved machine translation with manual semantic verification, producing the first multilingual evaluation framework for LLM safeguards of this type.

7. Code Availability and Adoption

ConsistentGuard is released as open source, with training scripts, benchmark data, and inference pipelines. The repository is:

https://github.com/johnnychanv/ConsistentGuard

This enables reproducibility and adoption in both academic and industry contexts, and supports further development and benchmarking in multilingual LLM safeguard research.

In summary, ConsistentGuard establishes a principled framework for multilingual, interpretable LLM prompt moderation. Through reasoning-centric SFT, RL with custom reasoning rewards, and cross-lingual CAO, it achieves robust, equitable performance and clear explanations across diverse languages, with unprecedented data efficiency (Chen et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data (2025)

Follow Topic

Get notified by email when new papers are published related to ConsistentGuard.

ConsistentGuard: Multilingual Safeguard

1. Framework Overview and Motivation

2. Reasoning-Enhanced Training Pipeline

3. Performance and Evaluation

4. Data Efficiency and Training Protocol

5. Interpretability and Chain-of-Thought Explanations

6. Generalization and Benchmark Extension

7. Code Availability and Adoption

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConsistentGuard: Multilingual Safeguard

1. Framework Overview and Motivation

2. Reasoning-Enhanced Training Pipeline

3. Performance and Evaluation

4. Data Efficiency and Training Protocol

5. Interpretability and Chain-of-Thought Explanations

6. Generalization and Benchmark Extension

7. Code Availability and Adoption

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research