Papers
Topics
Authors
Recent
Search
2000 character limit reached

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Published 30 Apr 2026 in cs.CR, cs.CL, and cs.LG | (2604.27861v1)

Abstract: Decompositional jailbreaks pose a critical threat to LLMs by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.

Summary

  • The paper presents a dual-encoder architecture leveraging Asymmetric Contrastive Learning (ACL) to cluster decomposed malicious fragments from untraceable LLM traffic.
  • It achieves a malicious intent recall above 0.76 with a false positive rate below 0.2%, outperforming traditional stateless and session-level defenses.
  • Extensive evaluation on 3.62M instructions confirms TwinGate’s scalability, low latency (<300ms P99), and robust performance against adaptive and white-box attacks.

TwinGate: Stateful Defense Against Decompositional Jailbreaks in Untraceable LLM Traffic

Introduction

Decompositional jailbreaks represent a sophisticated and increasingly prevalent threat for LLMs, where adversaries split a malicious query into semantically disjoint, individually benign prompts. In practical deployments, where request streams are continuous, fully anonymized, and highly interleaved, stateless defenses―such as single-turn guardrails and RLHF-aligned models―fail to detect and block the aggregated malicious intent. The paper "TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning" (2604.27861) introduces TwinGate, a dual-encoder system engineered for scalable stateful defense under untraceable traffic, leveraging Asymmetric Contrastive Learning (ACL) to cluster intent fragments while suppressing false positives via decision inheritance.

Problem Statement and Threat Model

The decompositional jailbreak paradigm exploits the structural limitations of conventional defenses by scattering fragments of a prohibited intent across multiple queries, sessions, and user identities. Under the real-world scenario of fully anonymized and interleaved traffic, defenders lack access to trustworthy metadata, rendering session-level monitoring and attribution infeasible. The detection challenge is thus reduced to semantic analysis: fragments typically have negligible embedding similarity, and maliciousness only manifests in aggregate.

TwinGate formalizes the defense as learning a stateful function Fdef(qt,Ht−1)\mathcal{F}_{def}(q_t, \mathcal{H}_{t-1}) over the global request history Ht−1\mathcal{H}_{t-1}, seeking high recall in intercepting cumulative malicious intents while maintaining a strict bound on False Positive Rate (FPR) for benign queries.

TwinGate Architecture

TwinGate comprises four principal modules: (i) a frozen encoder for semantic vectors, (ii) an ACL-finetuned encoder for intent vectors, (iii) dual in-memory vector repositories storing both representations, and (iv) a decision module executing fast-path inheritance (via semantic equivalence) and intent-based clustering (via ACL). This enables robust, scalable stateful retrieval with computational efficiency suitable for production-grade LLM APIs. Figure 1

Figure 1: TwinGate's end-to-end workflow combines dual encoding, stateful vector querying, decision inheritance, ACL clustering, and asynchronous vector database updates.

Asymmetric Contrastive Learning and Decision Inheritance

The ACL objective forces fragments from the same malicious intent―regardless of splitter artifacts or generation style―to coalesce tightly in latent space, while benign queries function only as repulsive negatives. This geometric separation is key to bridging the semantic gap, allowing retrieval-based detection of distributed fragments. However, aggressive clustering increases FPR risk, especially for repeated benign traffic. TwinGate resolves this via semantic equivalence inheritance using the frozen encoder: any query highly similar to a prior safe query bypasses ACL evaluation and inherits its verdict, stabilizing operational precision. Figure 2

Figure 2: Semantic pruning during ACL training eliminates latent collisions, optimizing the Recall-FPR curve.

Large-Scale Empirical Benchmark

The paper introduces a comprehensive dataset with 3.62M instructions: 603k independent benign queries, 250k benign decompositions, and 8,600+ malicious intents fragmented by multiple splitter models to simulate distributed attacks. Partitioning occurs at the intent level, ensuring robust evaluation. The diversity, scale, and strict causal evaluation protocol provide substantial coverage for stress-testing stateful defenses.

Security Effectiveness: Main Results

TwinGate's security evaluation centers on the Recall-FPR trade-off. Compared to stateless guardrails (Llama-Guard-3-8B, Intent-FT) and session-level monitors (Window Monitor), TwinGate achieves significantly higher recall at comparable or lower FPR:

  • Malicious Intent Recall: >0.76>0.76
  • False Positive Rate: <2×10−3<2 \times 10^{-3}

Notably, Window Monitor incurs FPRs an order of magnitude higher and is unsuitable for deployment, while stateless defenses exhibit recall saturation far below TwinGate's operation. TwinGate's capacity to disrupt decompositional attacks is robust even under strictly causal, unseen-fragment test streams. Figure 3

Figure 3: TwinGate's Recall-FPR curve dominates three baselines, achieving superior sensitivity with minimal false positives.

Throughput, Latency, and Scalability

TwinGate's hardware-aware pipeline utilizes four A100 GPUs via NVLink, parallelizing encoding and maintaining stateful databases in HBM. System profiling under aggressive QPS reveals negligible latency overhead (P99P_{99} latency <300<300ms) and throughput exceeding 1700 QPS, vastly outperforming generative-model-based baselines (Intent-FT, Window Monitor). Figure 4

Figure 4: TwinGate's throughput and latency remain stable under load, demonstrating production feasibility.

The vector database scales efficiently up to 6M vectors per GPU, with sub-linear latency growth; beyond this, hardware exhaustion becomes the limiting factor. Clustering-based LRU policies sustain defense efficacy even under aggressive capacity reduction and slow-loris attack simulations. Figure 5

Figure 5: TwinGate maintains high defense efficacy down to 25% database retention, bounding memory costs and privacy risks.

Robustness Against Adaptive/White-Box Attacks

Under a pessimistic white-box threat model (full knowledge of TwinGate, ACL encoder, thresholds, and k=1k=1 neighbor logic), adaptive evasion via rewriting or greedy coordinate gradient (GCG) optimization yields only marginal improvements in Attack Success Rate (ASR), peaking at 0.18. Semantic pollution attacks (intent-based Denial-of-Service) fail to inflate TwinGate's FPR, which remains stably low irrespective of injection scale. Figure 6

Figure 6: ASR remains low for adaptive strategies, demonstrating TwinGate's resilience to evasion attempts.

Figure 7

Figure 7: FPR is unaffected by increasing GCG-poisoned samples, validating TwinGate's resistance to semantic DoS attacks.

Ablation Study

Isolating architectural and algorithmic components reveals the necessity of both the dual-encoder design and ACL training. Removing the frozen encoder or ACL results in severe performance collapses; symmetric CL or omitting the raw monolithic intent anchor yields weak boundaries and poor generalization. TwinGate also generalizes robustly to unseen decomposer models, indicating that ACL learns fundamental semantic trajectories rather than overfitting to generator artifacts. Figure 8

Figure 8: TwinGate outperforms all ablated variants, including unseen decomposer generalization.

Theoretical Boundaries and Attack Complexity

The paper provides geometric bounds on the maximum number of evasive fragments (spherical packing) and proves computational intractability for discrete token-level evasion as constraint complexity increases. These theoretical results support empirical findings: adaptive agents face exponential search costs and practical decomposition limits.

Implications and Future Developments

TwinGate shifts the paradigm for LLM safety from stateless, generative-model-reliant filters to stateful, scalable vector retrieval, bridging semantic gaps for distributed, metadata-free attacks. Practically, this enables robust defense for real-world, high-throughput APIs with strong precision, throughput, and resilience to adversarial adaptation. The large-scale benchmark advances reproducible evaluation protocols. Theoretically, ACL's asymmetric clustering of intent space opens avenues for spanning more extreme semantic divergences, and for deploying defense policies tailored to hardware and privacy constraints (e.g., dynamic vector retention, rapid adaptation to novel attack schemas).

Further research should investigate integration of temporal dynamics, hierarchical clustering of intent, and the incorporation of richer multimodal signals―as well as adversarial co-evolution strategies. Scaling TwinGate-like architectures across distributed inference boundaries, federated deployment, and multi-agent environments represents a critical direction for securing advanced LLM systems under adversarial pressure.

Conclusion

TwinGate provides a scalable, production-viable defense mechanism for decompositional jailbreaks in LLMs operating under untraceable traffic. Its ACL-powered dual-encoder design achieves stateful malicious intent recall above 0.76 at FPR <0.2%<0.2\%, sustaining high throughput and low latency. The system's robustness against adaptive and pollution attacks, along with strong generalization and theoretical grounding, constitute an authoritative advance in LLM defense approaches. The accompanying benchmark dataset facilitates rigorous, reproducible evaluation for future stateful safety research.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.