MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning (2506.16792v3)

Published 20 Jun 2025 in cs.CL and cs.AI

Abstract: Despite efforts to align LLMs with societal and moral values, these models remain susceptible to jailbreak attacks -- methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box LLMs via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version -- order-determining optimization. We conduct extensive experiments on two datasets using two open-source and four closed-source models. Results show that MIST achieves competitive attack success rate, relatively low query count, and fair transferability, outperforming or matching state-of-the-art jailbreak methods. Additionally, we conduct analysis on computational efficiency to validate the practical viability of MIST.

Summary

The paper presents a novel iterative semantic tuning framework to jailbreak black-box large language models while preserving prompt semantic integrity.
It introduces two strategies, MIST-SSS and MIST-ODO, that leverage part-of-speech filtering and synonym substitution to optimize prompt refinement.
MIST achieves up to 91% attack success rates, outperforming baselines in efficiency and robustness against common defense mechanisms.

"MIST: Jailbreaking Black-box LLMs via Iterative Semantic Tuning" (2506.16792)

Introduction

The paper "MIST: Jailbreaking Black-box LLMs via Iterative Semantic Tuning" provides a novel approach to exploiting vulnerabilities in LLMs through a method the authors name MIST. This technique focuses on "jailbreaking" or generating harmful content from black-box LLMs by iteratively refining prompts while maintaining their semantic integrity. MIST aims to address current limitations in black-box attack scenarios, emphasizing computational efficiency and semantic coherence compared to existing methodologies.

Methodology

MIST employs an innovative framework leveraging iterative semantic tuning to refine prompts until they elicit a specified response. The method consists of three main stages: part-of-speech filtering, synonym set construction, and dual-objective iterative semantic tuning.

Figure 1: An illustration of MIST framework.

Iterative Semantic Tuning

Part-of-Speech Filtering and Synonym Sets: The process begins by filtering tokens based on their part-of-speech to focus on meaningful words. Synonym sets are constructed for these terms using WordNet to explore potential substitutions that preserve semantic content.
MIST-SSS and MIST-ODO Strategies:
- Sequential Synonym Search (MIST-SSS): Tokens in the prompt are sequentially substituted to find a compliant response, optimizing semantic similarity using all-mpnet-base-v2 embeddings and cosine similarity.
- Order-Determining Optimization (MIST-ODO): This advanced strategy involves three stages: random token substitution to initially achieve a non-refusal response, original token recovery to optimize semantic similarity while maintaining compliance, and computing optimization order based on semantic loss to guide efficient prompt refinement.
  Figure 2: An illustration of the three stages in MIST-ODO.

Experimental Results

MIST's effectiveness was validated across multiple LLMs and datasets, JailbreakBench and AdvBench, demonstrating high attack success rates and transferability.

Attack Success Rates: MIST-ODO outperformed existing baselines in achieving high success rates while maintaining prompt semantic integrity across both datasets. Particularly, it achieved an ASR-G (Attack Success Rate according to GPT evaluation) of up to 91% on certain models.
Efficiency:

MIST-ODO required significantly fewer query calls (e.g., an average of 23.2 on GPT-4-turbo) compared to other methods, underscoring its efficiency in a strict query budget scenario.

Figure 3: The comparison on computational efficiency of MIST with different parameters. (a) Relationship between token substitution count / query count and index $k^*$ of first tuned prompt satisfying jailbreak condition with different synonym set sizes. (b) Relationship between probability $P[Y=t_g]$ that Y equals given $t_g$ and random token substitution count $t_g$ under different alpha values.

Figure 4: Radar chart reflecting the ASR-G of our method MIST and baselines across six models on JailbreakBench.

Performance Against Defenses

MIST-ODO also maintained robustness when subjected to various defense mechanisms such as PPL-Filter, Backtranslation, and RID. These defenses could detect and thwart some jailbreak methods, but MIST-ODO demonstrated enhanced resistance and maintained efficacy.

Figure 5: Bar charts reflecting the ASR-G of MIST-ODO and baselines when applying three different defenses.

Conclusion

MIST provides a robust framework for exploring and exploiting the security boundaries of LLMs without requiring access to their internal components. By iterating on prompt semantics, MIST facilitates efficient and successful black-box jailbreaks, highlighting the ongoing challenges in aligning LLMs with safety standards. This work encourages further research into harnessing MIST-generated prompts as datasets for improving the resilience of LLMs against adversarial prompts, ultimately contributing to safer and more reliable AI systems.