Constitutional Classifiers++ (CC++)
- The paper introduces CC++ as a production-grade defense system for LLMs, achieving a 40× reduction in computational cost and a 0.05% refusal rate.
- Exchange-based context evaluation and a two-stage classifier cascade are employed to robustly mitigate universal jailbreaks in real-time deployments.
- Efficient linear probe classifiers combined with ensembling strategies ensure low inference overhead while maintaining high security and red-teaming resilience.
Constitutional Classifiers++ (CC++) is an end-to-end defense framework for LLMs that robustly mitigates universal jailbreaks while achieving low computational cost and refusal rates. CC++ extends previous Constitutional Classifier architectures by introducing exchange-based context evaluation, classifier cascades, and linear probes with ensembling, resulting in a system capable of production-grade deployment with a 40× reduction in computational overhead and a refusal rate of 0.05% on live production traffic (Cunningham et al., 8 Jan 2026).
1. Objectives and Conceptual Foundations
CC++ is architected to address critical deployment requirements for LLMs:
- Resistance to universal jailbreaks, i.e., attacks reliably bypassing safeguards across 5–8 or more diverse target queries.
- Maintenance of an exceptionally low false-positive refusal rate when handling benign user traffic.
- Enforcement of minimal inference overhead to ensure efficient production-scale operation.
These objectives are realized through three principal innovations: (a) exchange classifiers operating over conversational context, (b) a two-stage classifier cascade optimizing cost and precision, and (c) efficient linear probe classifiers on model activations, ensembled with autonomous external classifiers for enhanced robustness and efficiency (Cunningham et al., 8 Jan 2026).
2. Exchange Classifier Formalism
CC++ introduces the exchange classifier, which evaluates entire conversational sequences rather than single utterances. Define an exchange , with as the sequence of user messages and as the generated model responses.
A fine-tuned LLM, such as Claude Haiku 3.5–4.5, implements the exchange classifier by scoring the sequence as:
where is interpreted as a logit for the refusal probability,
Let denote model activations at layer and token when processing . The classifier can aggregate activations:
and compute logits , where indexes the last or highest-scoring token. This formalism enables context-aware evaluation, addressing limitations of previous input- or output-only classifiers (Cunningham et al., 8 Jan 2026).
3. Two-Stage Classifier Cascade
Cost efficiency is achieved by cascading two classifiers: a lightweight first-stage exchange classifier of approximately one-third the size of the full model, and a full-capacity second-stage classifier (e.g., Haiku 4.5). Decision thresholds and are calibrated to minimize false positives on production data.
The cascade operates as follows:
The first stage flags approximately 5.5% of queries for escalation, while the second stage delivers final precision. This stratified approach ensures a refusal rate of approximately 0.05% end-to-end, substantially reducing unnecessary refusals relative to single-stage or prior systems (Cunningham et al., 8 Jan 2026).
4. Linear Probe Classifiers and Ensemble Strategy
CC++ employs linear probe classifiers atop pre-computed transformer activations. At each generation step , probe features (single-layer or concatenated multi-layer activations) are projected as . Sliding-window averaging (with window tokens) smooths logits, later replaced at inference with an exponential moving average (EMA) to limit memory usage.
Training uses a softmax-weighted binary cross-entropy loss across positions :
where are exchange-level labels, and is a temperature parameter (set to 1).
For ensembling, probe logits are combined with external classifier logits :
with empirical near-optimal. Ensemble aggregation predominates in final decision-making, improving both recall and precision while reducing reliance on high-cost full LLM passes (Cunningham et al., 8 Jan 2026).
5. Computational Cost and Performance Metrics
CC++ achieves a dramatic reduction in computational burden compared to previous defenses. Table 1 summarizes key trade-offs:
| System | Rel. Compute | High-Risk Vulnerability Rate (×10⁻³) | Refusal Rate (%) |
|---|---|---|---|
| Last-Generation (input+output) | 100.0 | 18.71 | 0.073 |
| Exchange-Only Classifier | 150.0 | 8.85 | 0.038 |
| Two-Stage Exchange Cascade | 27.8 | 8.78 | 0.036 |
| Production-Grade CC++ (probe+ens) | 3.5 | 5.05 | 0.050 |
In deployment, the full exchange classifier costs 150% (relative to prior baseline at 100%), but the complete production system is more efficient. FLOP analysis on Gemma 3 27B reveals probe evaluation costs (~$377$K FLOPs/token) are over lower than a small 4B-parameter external LLM (~ FLOPs/token). CC++ leverages cascades and sparsity to amplify these savings (Cunningham et al., 8 Jan 2026).
6. Refusal Rate and Red-Teaming Robustness
The refusal rate in CC++ is the fraction of exchanges for which the system flags and halts a model response:
In December 2025 shadow deployment, CC++ maintained a refusal rate of 0.05% over production traffic.
Extensive human red-teaming (≈198,000 probing queries, 1,736 cumulative hours) demonstrated:
- Discovery of a single high-risk vulnerability.
- Vulnerability rate of 0.005 per thousand queries.
- No attack elicited successful responses to all eight tested universal jailbreak queries—a core CC++ target.
- Minimum time to first vulnerability was 30 hours.
This empirical evaluation highlights the robustness of CC++ against advanced adversarial prompting (Cunningham et al., 8 Jan 2026).
7. Deployment and Operational Considerations
CC++ employs several production-oriented optimizations:
- Streaming inference with prompt-caching allows classifier evaluations per token with minimal added latency.
- Use of EMA eliminates sliding-window averaging’s memory overhead at inference time.
- Batch processing of escalated exchanges and GPU off-loading for small external LLMs optimize throughput.
- Shadow-deployment logging enables continuous recalibration of thresholds (, ) and real-time refusal monitoring.
- Comprehensive end-to-end testing is enforced to prevent misconfigurations that could introduce vulnerabilities at the infrastructure level.
These strategies ensure that CC++ maintains both high availability and strong security guarantees across diverse, real-world LLM applications (Cunningham et al., 8 Jan 2026).