The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Published 13 Apr 2026 in cs.CR, cs.AI, cs.CL, cs.CV, and cs.LG | (2604.11309v1)

Abstract: LLMs face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper formalizes and empirically validates the Salami Slicing risk, demonstrating how benign multi-turn prompts can aggregate harmful outputs without triggering single-turn defenses.
It introduces a universal Salami Attack framework that breaks down harmful intents into incremental, sub-threshold queries, achieving 90–98% attack success rates across multiple models.
The proposed Cumulative Query Auditing (CQA) defense significantly reduces attack success, highlighting the need for sequence-level risk assessment in LLM safety mechanisms.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Introduction and Motivation

This work rigorously formalizes and empirically demonstrates the Salami Slicing Risk in contemporary LLM deployments. It exposes a pivotal alignment vulnerability: despite significant advances in alignment and single-turn refusal mechanisms, aligned LLMs remain susceptible to multi-turn, low-intensity adversarial probing that incrementally compounds ostensibly benign user intent until severe harmful content is elicited. Crucially, this is achieved without ever crossing single-turn safety thresholds, thereby bypassing current alignment infrastructure.

The paper articulates a general attack framework—the Salami Attack—and validates its efficacy against leading foundation models, vision-LLMs (VLMs), and diffusion models. It further introduces Cumulative Query Auditing (CQA), a high-utility, practically implementable defense that tightly couples theoretical guarantees with empirical effectiveness.

Figure 1: Real-world demonstration of the Salami Slicing threat on ChatGPT's production interface, where compound harmless queries culminate in a high-risk response without tripping any single-turn filter.

Theoretical Foundations and Vulnerability Analysis

Prior multi-turn jailbreaks—e.g., Crescendo, FITD, GOAT—leverage context drift and staged semantic shifts, but almost universally rely on an explicit, high-risk final query. This invariably triggers modern single-turn refusal filters, undermining practical effectiveness and greatly restricting transferability. In contrast, this work identifies that LLM safety infrastructures predominantly operate as single-turn thresholding systems, with attention mechanisms and internal safety circuits highly biased to the most recent prompt. This is substantiated by probing model internals (especially attention head L2H26 in Llama-2-7B-Chat), revealing marked recency focus in safety-critical computation.

Figure 2: Attention head analysis in Llama-2 reveals contextual focus on the latest user turn during a successful 4-turn Salami Attack, corroborating the hypothesis that safety alignment is predominantly single-turn.

Further, mechanistic interpretability on early LLM layers shows that multi-turn attacks incrementally progress through activation space without entering the distinct region associated with explicit refusal—a clear divergence from single-turn attacks.

Figure 3: 2D PCA of activations at hidden layers indicate multi-turn attacks gradually depart from non-refusal regions, yet avoid activation clusters corresponding to explicit model refusals.

The authors then formalize the Salami Slicing Risk with a rigorous abstraction: for every aligned LLM with a refusal threshold $\tau_\text{thres}$ and ideal harmfulness scorer, there exists a prompt sequence $[p_1, \dots, p_n]$ , each individually sub-threshold ( $\tilde{H}(p, \mathbf{p}) < \tau_\text{thres}$ ), such that the cumulative conversation context surpasses any desired harm level $h$ , all without triggering a single explicit refusal. This is proven with careful construction of prompt sequences and grounded in the empirical dynamics of current LLM safety systems.

The Salami Attack: Methodology and Performance

The Salami Attack framework instantiates the theoretical risk in both manual and automated modes. The automated pipeline decomposes a harmful intent $p$ into a safe template plus $k$ progressive perturbations, each repeated $t$ times, ensuring each step avoids abrupt semantic deltas and typical alignment sentinels. The attack is universal in that it does not require model-specific context engineering and is fully black-box deployable.

Figure 4: Flowchart for automated Salami Attack—LLMs generate decomposed sequences of benign-appearing queries that, when compounded, systematically escalate harm.

The adaptive A-Salami extension utilizes a multi-agent approach with reflection, evaluation, and memory modules to dynamically revise strategies in the face of context drift, mid-turn refusals, or unpredictable model responses. This enhances the attack's robustness on stateful, highly aligned or contextually volatile LLM targets.

Numerical Results and Model Implications

Empirical evaluation is performed on leading SOTA LLMs (GPT-4o, Gemini 2.5 Pro, Deepseek V3, Qwen3-235B, GLM-4.5) and benchmarks (AdvBench, HarmBench, JailbreakBench). Salami Attack achieves 90–98% ASR across all models and benchmarks, consistently outperforming prior SOTA multi-turn and agentic attacks (Crescendo, GOAT, CFA, CIA, PAIR). Harmful scores on a 1–10 scale are elevated (typically 8.5–9.8 per attack), indicating not merely filter bypass but substantive, actionable harmful output. The adaptive variant (A-Salami) pushes success rates near the theoretical max.

Efficiency gains are substantial—the Salami Attack reduces attacker token cost by over 80% and execution time by 50% relative to other multi-turn attacks. Moreover, its performance preserves under cross-model transfer, with high ASR even when attack prompt sequences crafted on one LLM are launched against orthogonal deploys.

Additionally, salami-based pipelines generalize to multi-modal systems:

VLMs: On MM-SafetyBench, ASR is consistently above 55% on both GPT-4o and Qwen-VL-Max, exceeding visual-jailbreak baselines.
Diffusion Models: On OVERT, Salami Attack achieves an 80% filter bypass rate with 81% prompt-visual semantic consistency.

Figure 5: Salami Attack generates harmful images from originally benign queries via iterative textual perturbations in diffusion-based systems.

Salami Attack maintains high harmfulness across all HarmBench categories, substantially outperforming Crescendo and GOAT in "softer" categories demanding high-fidelity creative generation, confirming its broad coverage and lack of reliance on blunt, policy-violation triggers.

Figure 6: Category-wise breakdown of harmful scores shows Salami Attack dominance, especially in classes requiring nuanced, content-rich generation.

Robustness to Defenses and Mitigation Strategies

Evaluation against current defenses—including Self-Reminder prompting, Input Filtering (external LLM-based), and Smooth LLM (consistency/randomization-based detection)—finds that Salami Attack maintains an average ASR of 83.5%, a marginal reduction compared to unprotected systems. Existing defenses, heavily tuned for final-turn refusals or keyword-spotting, are systematically evaded since the attack distributes harm below per-turn detection thresholds. Notably, even advanced LLM content filtering systems only minimally constrain the attack.

To address this, the authors propose Cumulative Query Auditing (CQA), a defense that audits the entire historical prompt sequence for cumulative harm via the model's own alignment infrastructure, rather than evaluating solely the most recent utterance. This generalizes refusal triggers to multi-turn context, effectively neutralizing Salami Slicing.

When deployed, CQA reduces Salami Attack ASR by 44.8% (from 87.0% to 48.0%) and outperforms all prior reactive defense systems. Importantly, over-refusal (false positive) rates remain negligible, as validated on benign prompt collections.

Broader Implications and Future Directions

The paper's characterization of Salami Slicing as an alignment vulnerability decisively reframes the engineering of LLM safety infrastructures and raises several implications:

Real-world LLM systems must reason about long-term, aggregate conversational risk, not just single-turn toxicity. Any aligned model with sequential context aggregation is fundamentally susceptible to salami-style decomposition attacks if it only applies per-turn auditing.
Black-box, model-agnostic multi-turn attack surfaces are starkly larger than currently acknowledged. The Salami Attack framework provides a tractable, scalable tool for red-teaming, revealing latent failure modes invisible to traditional prompt engineering-based approaches.
CQA and other cumulative auditing frameworks must become standard in LLM deployment pipelines. Defense robustness and evaluation must target sequence-level, not utterance-level, harm.

The findings also open directions for designing models with formal accumulative risk awareness (e.g., context-wide attention regularization, explicit multi-turn alignment heads) and richer, offense-informed pretraining or RLHF methodologies.

Conclusion

This study articulates, substantiates, and mitigates a central emergent risk in multi-turn LLM deployments. By formalizing the Salami Slicing Risk, developing a universal attack instantiation, and delivering the first mathematically principled and empirically effective cumulative-defense strategy (CQA), it advances both the theory and practice of LLM alignment. The results emphasize a pivotal security paradigm shift: LLM guardrails must transition from single-turn defensibility to holistic, cumulative risk management to ensure robust, real-world AI safety.

Reference: "The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems" (2604.11309)

Markdown Report Issue