- The paper presents a dual-encoder architecture leveraging Asymmetric Contrastive Learning (ACL) to cluster decomposed malicious fragments from untraceable LLM traffic.
- It achieves a malicious intent recall above 0.76 with a false positive rate below 0.2%, outperforming traditional stateless and session-level defenses.
- Extensive evaluation on 3.62M instructions confirms TwinGate’s scalability, low latency (<300ms P99), and robust performance against adaptive and white-box attacks.
TwinGate: Stateful Defense Against Decompositional Jailbreaks in Untraceable LLM Traffic
Introduction
Decompositional jailbreaks represent a sophisticated and increasingly prevalent threat for LLMs, where adversaries split a malicious query into semantically disjoint, individually benign prompts. In practical deployments, where request streams are continuous, fully anonymized, and highly interleaved, stateless defenses―such as single-turn guardrails and RLHF-aligned models―fail to detect and block the aggregated malicious intent. The paper "TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning" (2604.27861) introduces TwinGate, a dual-encoder system engineered for scalable stateful defense under untraceable traffic, leveraging Asymmetric Contrastive Learning (ACL) to cluster intent fragments while suppressing false positives via decision inheritance.
Problem Statement and Threat Model
The decompositional jailbreak paradigm exploits the structural limitations of conventional defenses by scattering fragments of a prohibited intent across multiple queries, sessions, and user identities. Under the real-world scenario of fully anonymized and interleaved traffic, defenders lack access to trustworthy metadata, rendering session-level monitoring and attribution infeasible. The detection challenge is thus reduced to semantic analysis: fragments typically have negligible embedding similarity, and maliciousness only manifests in aggregate.
TwinGate formalizes the defense as learning a stateful function Fdef​(qt​,Ht−1​) over the global request history Ht−1​, seeking high recall in intercepting cumulative malicious intents while maintaining a strict bound on False Positive Rate (FPR) for benign queries.
TwinGate Architecture
TwinGate comprises four principal modules: (i) a frozen encoder for semantic vectors, (ii) an ACL-finetuned encoder for intent vectors, (iii) dual in-memory vector repositories storing both representations, and (iv) a decision module executing fast-path inheritance (via semantic equivalence) and intent-based clustering (via ACL). This enables robust, scalable stateful retrieval with computational efficiency suitable for production-grade LLM APIs.
Figure 1: TwinGate's end-to-end workflow combines dual encoding, stateful vector querying, decision inheritance, ACL clustering, and asynchronous vector database updates.
Asymmetric Contrastive Learning and Decision Inheritance
The ACL objective forces fragments from the same malicious intent―regardless of splitter artifacts or generation style―to coalesce tightly in latent space, while benign queries function only as repulsive negatives. This geometric separation is key to bridging the semantic gap, allowing retrieval-based detection of distributed fragments. However, aggressive clustering increases FPR risk, especially for repeated benign traffic. TwinGate resolves this via semantic equivalence inheritance using the frozen encoder: any query highly similar to a prior safe query bypasses ACL evaluation and inherits its verdict, stabilizing operational precision.
Figure 2: Semantic pruning during ACL training eliminates latent collisions, optimizing the Recall-FPR curve.
Large-Scale Empirical Benchmark
The paper introduces a comprehensive dataset with 3.62M instructions: 603k independent benign queries, 250k benign decompositions, and 8,600+ malicious intents fragmented by multiple splitter models to simulate distributed attacks. Partitioning occurs at the intent level, ensuring robust evaluation. The diversity, scale, and strict causal evaluation protocol provide substantial coverage for stress-testing stateful defenses.
Security Effectiveness: Main Results
TwinGate's security evaluation centers on the Recall-FPR trade-off. Compared to stateless guardrails (Llama-Guard-3-8B, Intent-FT) and session-level monitors (Window Monitor), TwinGate achieves significantly higher recall at comparable or lower FPR:
- Malicious Intent Recall: >0.76
- False Positive Rate: <2×10−3
Notably, Window Monitor incurs FPRs an order of magnitude higher and is unsuitable for deployment, while stateless defenses exhibit recall saturation far below TwinGate's operation. TwinGate's capacity to disrupt decompositional attacks is robust even under strictly causal, unseen-fragment test streams.
Figure 3: TwinGate's Recall-FPR curve dominates three baselines, achieving superior sensitivity with minimal false positives.
Throughput, Latency, and Scalability
TwinGate's hardware-aware pipeline utilizes four A100 GPUs via NVLink, parallelizing encoding and maintaining stateful databases in HBM. System profiling under aggressive QPS reveals negligible latency overhead (P99​ latency <300ms) and throughput exceeding 1700 QPS, vastly outperforming generative-model-based baselines (Intent-FT, Window Monitor).
Figure 4: TwinGate's throughput and latency remain stable under load, demonstrating production feasibility.
The vector database scales efficiently up to 6M vectors per GPU, with sub-linear latency growth; beyond this, hardware exhaustion becomes the limiting factor. Clustering-based LRU policies sustain defense efficacy even under aggressive capacity reduction and slow-loris attack simulations.
Figure 5: TwinGate maintains high defense efficacy down to 25% database retention, bounding memory costs and privacy risks.
Robustness Against Adaptive/White-Box Attacks
Under a pessimistic white-box threat model (full knowledge of TwinGate, ACL encoder, thresholds, and k=1 neighbor logic), adaptive evasion via rewriting or greedy coordinate gradient (GCG) optimization yields only marginal improvements in Attack Success Rate (ASR), peaking at 0.18. Semantic pollution attacks (intent-based Denial-of-Service) fail to inflate TwinGate's FPR, which remains stably low irrespective of injection scale.
Figure 6: ASR remains low for adaptive strategies, demonstrating TwinGate's resilience to evasion attempts.
Figure 7: FPR is unaffected by increasing GCG-poisoned samples, validating TwinGate's resistance to semantic DoS attacks.
Ablation Study
Isolating architectural and algorithmic components reveals the necessity of both the dual-encoder design and ACL training. Removing the frozen encoder or ACL results in severe performance collapses; symmetric CL or omitting the raw monolithic intent anchor yields weak boundaries and poor generalization.
TwinGate also generalizes robustly to unseen decomposer models, indicating that ACL learns fundamental semantic trajectories rather than overfitting to generator artifacts.
Figure 8: TwinGate outperforms all ablated variants, including unseen decomposer generalization.
Theoretical Boundaries and Attack Complexity
The paper provides geometric bounds on the maximum number of evasive fragments (spherical packing) and proves computational intractability for discrete token-level evasion as constraint complexity increases. These theoretical results support empirical findings: adaptive agents face exponential search costs and practical decomposition limits.
Implications and Future Developments
TwinGate shifts the paradigm for LLM safety from stateless, generative-model-reliant filters to stateful, scalable vector retrieval, bridging semantic gaps for distributed, metadata-free attacks. Practically, this enables robust defense for real-world, high-throughput APIs with strong precision, throughput, and resilience to adversarial adaptation. The large-scale benchmark advances reproducible evaluation protocols.
Theoretically, ACL's asymmetric clustering of intent space opens avenues for spanning more extreme semantic divergences, and for deploying defense policies tailored to hardware and privacy constraints (e.g., dynamic vector retention, rapid adaptation to novel attack schemas).
Further research should investigate integration of temporal dynamics, hierarchical clustering of intent, and the incorporation of richer multimodal signals―as well as adversarial co-evolution strategies. Scaling TwinGate-like architectures across distributed inference boundaries, federated deployment, and multi-agent environments represents a critical direction for securing advanced LLM systems under adversarial pressure.
Conclusion
TwinGate provides a scalable, production-viable defense mechanism for decompositional jailbreaks in LLMs operating under untraceable traffic. Its ACL-powered dual-encoder design achieves stateful malicious intent recall above 0.76 at FPR <0.2%, sustaining high throughput and low latency. The system's robustness against adaptive and pollution attacks, along with strong generalization and theoretical grounding, constitute an authoritative advance in LLM defense approaches. The accompanying benchmark dataset facilitates rigorous, reproducible evaluation for future stateful safety research.