Papers
Topics
Authors
Recent
Search
2000 character limit reached

Constitutional Classifiers++ (CC++)

Updated 10 January 2026
  • The paper introduces CC++ as a production-grade defense system for LLMs, achieving a 40× reduction in computational cost and a 0.05% refusal rate.
  • Exchange-based context evaluation and a two-stage classifier cascade are employed to robustly mitigate universal jailbreaks in real-time deployments.
  • Efficient linear probe classifiers combined with ensembling strategies ensure low inference overhead while maintaining high security and red-teaming resilience.

Constitutional Classifiers++ (CC++) is an end-to-end defense framework for LLMs that robustly mitigates universal jailbreaks while achieving low computational cost and refusal rates. CC++ extends previous Constitutional Classifier architectures by introducing exchange-based context evaluation, classifier cascades, and linear probes with ensembling, resulting in a system capable of production-grade deployment with a 40× reduction in computational overhead and a refusal rate of 0.05% on live production traffic (Cunningham et al., 8 Jan 2026).

1. Objectives and Conceptual Foundations

CC++ is architected to address critical deployment requirements for LLMs:

  • Resistance to universal jailbreaks, i.e., attacks reliably bypassing safeguards across 5–8 or more diverse target queries.
  • Maintenance of an exceptionally low false-positive refusal rate when handling benign user traffic.
  • Enforcement of minimal inference overhead to ensure efficient production-scale operation.

These objectives are realized through three principal innovations: (a) exchange classifiers operating over conversational context, (b) a two-stage classifier cascade optimizing cost and precision, and (c) efficient linear probe classifiers on model activations, ensembled with autonomous external classifiers for enhanced robustness and efficiency (Cunningham et al., 8 Jan 2026).

2. Exchange Classifier Formalism

CC++ introduces the exchange classifier, which evaluates entire conversational sequences rather than single utterances. Define an exchange E=(U,V)E = (U, V), with U=(u1,,um)U = (u_1, \ldots, u_m) as the sequence of user messages and V=(v1,,vn)V = (v_1, \ldots, v_n) as the generated model responses.

A fine-tuned LLM, such as Claude Haiku 3.5–4.5, implements the exchange classifier fexchangef_{\text{exchange}} by scoring the sequence as:

s(E)=fexchange(E)Rs(E) = f_{\text{exchange}}(E) \in \mathbb{R}

where s(E)s(E) is interpreted as a logit for the refusal probability,

prefuse(E)=σ(s(E))=11+es(E)p_{\text{refuse}}(E) = \sigma(s(E)) = \frac{1}{1 + e^{-s(E)}}

Let ϕ(t;E)Rd\phi_\ell(t; E) \in \mathbb{R}^d denote model activations at layer \ell and token tt when processing [U;V][U; V]. The classifier can aggregate activations:

Ψ(E)=[ϕ1(t;E);...;ϕk(t;E)]\Psi(E) = [\phi_{\ell_1}(t^*; E);\, ...;\, \phi_{\ell_k}(t^*; E)]

and compute logits s(E)=WΨ(E)+bs(E) = W^\top \Psi(E) + b, where tt^* indexes the last or highest-scoring token. This formalism enables context-aware evaluation, addressing limitations of previous input- or output-only classifiers (Cunningham et al., 8 Jan 2026).

3. Two-Stage Classifier Cascade

Cost efficiency is achieved by cascading two classifiers: a lightweight first-stage exchange classifier f1(E)f_1(E) of approximately one-third the size of the full model, and a full-capacity second-stage classifier f2(E)f_2(E) (e.g., Haiku 4.5). Decision thresholds τ1\tau_1 and τ2\tau_2 are calibrated to minimize false positives on production data.

The cascade operates as follows:

Decision(E)={Acceptf1(E)<τ1 Acceptf1(E)τ1f2(E)<τ2 Refusef1(E)τ1f2(E)τ2\text{Decision}(E) = \begin{cases} \text{Accept} & f_1(E) < \tau_1 \ \text{Accept} & f_1(E) \ge \tau_1 \wedge f_2(E) < \tau_2 \ \text{Refuse} & f_1(E) \ge \tau_1 \wedge f_2(E) \ge \tau_2 \end{cases}

The first stage flags approximately 5.5% of queries for escalation, while the second stage delivers final precision. This stratified approach ensures a refusal rate of approximately 0.05% end-to-end, substantially reducing unnecessary refusals relative to single-stage or prior systems (Cunningham et al., 8 Jan 2026).

4. Linear Probe Classifiers and Ensemble Strategy

CC++ employs linear probe classifiers atop pre-computed transformer activations. At each generation step tt, probe features ψt=ϕ(t)\psi_t = \phi_\ell(t) (single-layer or concatenated multi-layer activations) are projected as zt=Wψt+bz_t = W^\top \psi_t + b. Sliding-window averaging (with window M16M \approx 16 tokens) smooths logits, later replaced at inference with an exponential moving average (EMA) to limit memory usage.

Training uses a softmax-weighted binary cross-entropy loss across positions tMt \ge M:

wt=exp(zˉt/τ)t=MTexp(zˉt/τ),L=t=MTwtBCE(y,σ(zˉt))w_t = \frac{\exp(\bar{z}_t/\tau)}{\sum_{t'=M}^{T} \exp(\bar{z}_{t'}/\tau)}, \quad \mathcal{L} = \sum_{t=M}^{T} w_t\,\mathrm{BCE}(y, \sigma(\bar{z}_t))

where y{0,1}y \in \{0, 1\} are exchange-level labels, and τ\tau is a temperature parameter (set to 1).

For ensembling, probe logits zprobe(E)z_\text{probe}(E) are combined with external classifier logits zS(E),zXS(E)z_S(E), z_{XS}(E):

zens(E)=αzprobe(E)+(1α)zS(E),pens=σ(zens(E))z_\text{ens}(E) = \alpha z_\text{probe}(E) + (1-\alpha) z_S(E), \quad p_\text{ens} = \sigma(z_\text{ens}(E))

with empirical α0.5\alpha \approx 0.5 near-optimal. Ensemble aggregation predominates in final decision-making, improving both recall and precision while reducing reliance on high-cost full LLM passes (Cunningham et al., 8 Jan 2026).

5. Computational Cost and Performance Metrics

CC++ achieves a dramatic reduction in computational burden compared to previous defenses. Table 1 summarizes key trade-offs:

System Rel. Compute High-Risk Vulnerability Rate (×10⁻³) Refusal Rate (%)
Last-Generation (input+output) 100.0 18.71 0.073
Exchange-Only Classifier 150.0 8.85 0.038
Two-Stage Exchange Cascade 27.8 8.78 0.036
Production-Grade CC++ (probe+ens) 3.5 5.05 0.050

In deployment, the full exchange classifier costs 150% (relative to prior baseline at 100%), but the complete production system is 150%/3.5%43×150\%/3.5\% \approx 43 \times more efficient. FLOP analysis on Gemma 3 27B reveals probe evaluation costs (~$377$K FLOPs/token) are over 20,000×20,000\times lower than a small 4B-parameter external LLM (~8×1098\times10^9 FLOPs/token). CC++ leverages cascades and sparsity to amplify these savings (Cunningham et al., 8 Jan 2026).

6. Refusal Rate and Red-Teaming Robustness

The refusal rate in CC++ is the fraction of exchanges for which the system flags and halts a model response:

RefusalRate=#exchanges refused#total exchanges×100%\mathrm{RefusalRate} = \frac{\# \text{exchanges refused}}{\#\text{total exchanges}}\times 100\%

In December 2025 shadow deployment, CC++ maintained a refusal rate of 0.05% over production traffic.

Extensive human red-teaming (≈198,000 probing queries, 1,736 cumulative hours) demonstrated:

  • Discovery of a single high-risk vulnerability.
  • Vulnerability rate of 0.005 per thousand queries.
  • No attack elicited successful responses to all eight tested universal jailbreak queries—a core CC++ target.
  • Minimum time to first vulnerability was 30 hours.

This empirical evaluation highlights the robustness of CC++ against advanced adversarial prompting (Cunningham et al., 8 Jan 2026).

7. Deployment and Operational Considerations

CC++ employs several production-oriented optimizations:

  • Streaming inference with prompt-caching allows classifier evaluations per token with minimal added latency.
  • Use of EMA eliminates sliding-window averaging’s memory overhead at inference time.
  • Batch processing of escalated exchanges and GPU off-loading for small external LLMs optimize throughput.
  • Shadow-deployment logging enables continuous recalibration of thresholds (τ1\tau_1, τ2\tau_2) and real-time refusal monitoring.
  • Comprehensive end-to-end testing is enforced to prevent misconfigurations that could introduce vulnerabilities at the infrastructure level.

These strategies ensure that CC++ maintains both high availability and strong security guarantees across diverse, real-world LLM applications (Cunningham et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Constitutional Classifiers++ (CC++).