Constitutional Classifiers++ (CC++)

Updated 10 January 2026

The paper introduces CC++ as a production-grade defense system for LLMs, achieving a 40× reduction in computational cost and a 0.05% refusal rate.
Exchange-based context evaluation and a two-stage classifier cascade are employed to robustly mitigate universal jailbreaks in real-time deployments.
Efficient linear probe classifiers combined with ensembling strategies ensure low inference overhead while maintaining high security and red-teaming resilience.

Constitutional Classifiers++ (CC++) is an end-to-end defense framework for LLMs that robustly mitigates universal jailbreaks while achieving low computational cost and refusal rates. CC++ extends previous Constitutional Classifier architectures by introducing exchange-based context evaluation, classifier cascades, and linear probes with ensembling, resulting in a system capable of production-grade deployment with a 40× reduction in computational overhead and a refusal rate of 0.05% on live production traffic (Cunningham et al., 8 Jan 2026).

1. Objectives and Conceptual Foundations

CC++ is architected to address critical deployment requirements for LLMs:

Resistance to universal jailbreaks, i.e., attacks reliably bypassing safeguards across 5–8 or more diverse target queries.
Maintenance of an exceptionally low false-positive refusal rate when handling benign user traffic.
Enforcement of minimal inference overhead to ensure efficient production-scale operation.

These objectives are realized through three principal innovations: (a) exchange classifiers operating over conversational context, (b) a two-stage classifier cascade optimizing cost and precision, and (c) efficient linear probe classifiers on model activations, ensembled with autonomous external classifiers for enhanced robustness and efficiency (Cunningham et al., 8 Jan 2026).

2. Exchange Classifier Formalism

CC++ introduces the exchange classifier, which evaluates entire conversational sequences rather than single utterances. Define an exchange $E = (U, V)$ , with $U = (u_1, \ldots, u_m)$ as the sequence of user messages and $V = (v_1, \ldots, v_n)$ as the generated model responses.

A fine-tuned LLM, such as Claude Haiku 3.5–4.5, implements the exchange classifier $f_{\text{exchange}}$ by scoring the sequence as:

$s(E) = f_{\text{exchange}}(E) \in \mathbb{R}$

where $s(E)$ is interpreted as a logit for the refusal probability,

$p_{\text{refuse}}(E) = \sigma(s(E)) = \frac{1}{1 + e^{-s(E)}}$

Let $\phi_\ell(t; E) \in \mathbb{R}^d$ denote model activations at layer $\ell$ and token $t$ when processing $[U; V]$ . The classifier can aggregate activations:

$\Psi(E) = [\phi_{\ell_1}(t^*; E);\, ...;\, \phi_{\ell_k}(t^*; E)]$

and compute logits $s(E) = W^\top \Psi(E) + b$ , where $t^*$ indexes the last or highest-scoring token. This formalism enables context-aware evaluation, addressing limitations of previous input- or output-only classifiers (Cunningham et al., 8 Jan 2026).

3. Two-Stage Classifier Cascade

Cost efficiency is achieved by cascading two classifiers: a lightweight first-stage exchange classifier $f_1(E)$ of approximately one-third the size of the full model, and a full-capacity second-stage classifier $f_2(E)$ (e.g., Haiku 4.5). Decision thresholds $\tau_1$ and $\tau_2$ are calibrated to minimize false positives on production data.

The cascade operates as follows:

$\text{Decision}(E) = \begin{cases} \text{Accept} & f_1(E) < \tau_1 \ \text{Accept} & f_1(E) \ge \tau_1 \wedge f_2(E) < \tau_2 \ \text{Refuse} & f_1(E) \ge \tau_1 \wedge f_2(E) \ge \tau_2 \end{cases}$

The first stage flags approximately 5.5% of queries for escalation, while the second stage delivers final precision. This stratified approach ensures a refusal rate of approximately 0.05% end-to-end, substantially reducing unnecessary refusals relative to single-stage or prior systems (Cunningham et al., 8 Jan 2026).

4. Linear Probe Classifiers and Ensemble Strategy

CC++ employs linear probe classifiers atop pre-computed transformer activations. At each generation step $t$ , probe features $\psi_t = \phi_\ell(t)$ (single-layer or concatenated multi-layer activations) are projected as $z_t = W^\top \psi_t + b$ . Sliding-window averaging (with window $M \approx 16$ tokens) smooths logits, later replaced at inference with an exponential moving average (EMA) to limit memory usage.

Training uses a softmax-weighted binary cross-entropy loss across positions $t \ge M$ :

$w_t = \frac{\exp(\bar{z}_t/\tau)}{\sum_{t'=M}^{T} \exp(\bar{z}_{t'}/\tau)}, \quad \mathcal{L} = \sum_{t=M}^{T} w_t\,\mathrm{BCE}(y, \sigma(\bar{z}_t))$

where $y \in \{0, 1\}$ are exchange-level labels, and $\tau$ is a temperature parameter (set to 1).

For ensembling, probe logits $z_\text{probe}(E)$ are combined with external classifier logits $z_S(E), z_{XS}(E)$ :

$z_\text{ens}(E) = \alpha z_\text{probe}(E) + (1-\alpha) z_S(E), \quad p_\text{ens} = \sigma(z_\text{ens}(E))$

with empirical $\alpha \approx 0.5$ near-optimal. Ensemble aggregation predominates in final decision-making, improving both recall and precision while reducing reliance on high-cost full LLM passes (Cunningham et al., 8 Jan 2026).

5. Computational Cost and Performance Metrics

CC++ achieves a dramatic reduction in computational burden compared to previous defenses. Table 1 summarizes key trade-offs:

System	Rel. Compute	High-Risk Vulnerability Rate (×10⁻³)	Refusal Rate (%)
Last-Generation (input+output)	100.0	18.71	0.073
Exchange-Only Classifier	150.0	8.85	0.038
Two-Stage Exchange Cascade	27.8	8.78	0.036
Production-Grade CC++ (probe+ens)	3.5	5.05	0.050

In deployment, the full exchange classifier costs 150% (relative to prior baseline at 100%), but the complete production system is $150\%/3.5\% \approx 43 \times$ more efficient. FLOP analysis on Gemma 3 27B reveals probe evaluation costs (~$377$K FLOPs/token) are over $20,000\times$ lower than a small 4B-parameter external LLM (~ $8\times10^9$ FLOPs/token). CC++ leverages cascades and sparsity to amplify these savings (Cunningham et al., 8 Jan 2026).

6. Refusal Rate and Red-Teaming Robustness

The refusal rate in CC++ is the fraction of exchanges for which the system flags and halts a model response:

$\mathrm{RefusalRate} = \frac{\# \text{exchanges refused}}{\#\text{total exchanges}}\times 100\%$

In December 2025 shadow deployment, CC++ maintained a refusal rate of 0.05% over production traffic.

Extensive human red-teaming (≈198,000 probing queries, 1,736 cumulative hours) demonstrated:

Discovery of a single high-risk vulnerability.
Vulnerability rate of 0.005 per thousand queries.
No attack elicited successful responses to all eight tested universal jailbreak queries—a core CC++ target.
Minimum time to first vulnerability was 30 hours.

This empirical evaluation highlights the robustness of CC++ against advanced adversarial prompting (Cunningham et al., 8 Jan 2026).

7. Deployment and Operational Considerations

CC++ employs several production-oriented optimizations:

Streaming inference with prompt-caching allows classifier evaluations per token with minimal added latency.
Use of EMA eliminates sliding-window averaging’s memory overhead at inference time.
Batch processing of escalated exchanges and GPU off-loading for small external LLMs optimize throughput.
Shadow-deployment logging enables continuous recalibration of thresholds ( $\tau_1$ , $\tau_2$ ) and real-time refusal monitoring.
Comprehensive end-to-end testing is enforced to prevent misconfigurations that could introduce vulnerabilities at the infrastructure level.

These strategies ensure that CC++ maintains both high availability and strong security guarantees across diverse, real-world LLM applications (Cunningham et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Constitutional Classifiers++ (CC++).