Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-lingual Multi-turn Automated Red Teaming

Updated 9 June 2026
  • MM-ART is a framework that automates multi-turn, multi-lingual adversarial interactions to uncover nuanced safety vulnerabilities in large language models.
  • It employs phased attack construction, persona conditioning, and iterative probing to simulate sophisticated, adaptive adversarial strategies.
  • Empirical evaluations show substantial increases in attack success rates across languages, emphasizing the need for robust, translation-aware safety measures.

Multi-lingual Multi-turn Automated Red Teaming (MM-ART) is a class of frameworks and experimental pipelines designed to measure and probe the safety boundaries of LLMs and agentic LLM-based systems, specifically by automating adversarial interactions that span multiple conversational turns in diverse languages. MM-ART systematically simulates sophisticated attacks that go significantly beyond single-turn probing, targeting the nuanced vulnerabilities that emerge during extended, adaptive dialogue. Implementations of MM-ART incorporate adversarial strategy decomposition, persona conditioning, iterative probing, and multilingual evaluation—yielding comprehensive assessments of system robustness across both depth (turn count) and linguistic breadth (Talokar et al., 18 Feb 2026, Singhania et al., 4 Apr 2025, Yuan et al., 6 Jan 2026).

1. Conceptual Foundations and Threat Model

MM-ART is grounded in the need to automate the discovery of safety failures that arise during multi-turn engagements—in particular, scenarios where a malicious actor leverages the agentic capabilities of LLMs (including tool use, memory, and role orientation) to progressively orchestrate complex, illicit workflows. The principal objective is to quantify how, and under what conditions, LLM agents can be induced to assist in harmful activity not in a single prompt but through a coordinated sequence of conversational steps. The framework is inherently black-box: attackers have no knowledge of a model’s internal parameters or defenses, operating purely on observable interactions and accessible tool APIs.

The core adversarial workflow is decomposed as follows:

  • Strategist: Decomposes an illicit or harmful intent HH into a sequence of minimal atomic phases P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}, each approachable as a standalone subgoal. A plausible benign persona ψ\psi is synthesized to mask the true intent.
  • Attacker: Enacts the persona and iteratively queries the target model, adaptively rephrasing prompts and advancing phases using feedback from previous turns.
  • Judge Agents: Automated evaluators classify each target response for (i) explicit or implicit refusals and (ii) successful completion of the current subgoal.

This staged, adaptive attack loop is executed under resource constraints, specified by a maximum number SmaxS_{\max} of phased strategies and a maximum per-strategy turn cap TmaxT_{\max} (Talokar et al., 18 Feb 2026, Singhania et al., 4 Apr 2025).

2. MM-ART Framework Architectures and Detailed Methodology

MM-ART is instantiated in various research frameworks; STING (“Sequential Testing of Illicit N-step Goal execution”) represents a leading implementation. All MM-ART pipelines operate through algorithmic modules that generate probing conversations, conduct iterative multi-turn adversarial exchanges, and assess safety boundary violations both in the original and translated form across supported languages.

Phased Attack Construction: The attack is constructed by first synthesizing a persona and cover-story (ψ\psi), followed by decomposing the overall malicious goal HH into NN tool-callable atomic subgoals pip_i. This decomposition is critical for evading static safety defenses that frequently gate single-step attempts but may fail to detect composite, distributed risks.

Probing and Judge Loop: For each phase, the attacker prompts the target under the persona. Responses are judged as refusals or completions. If refusal is detected, the query is adapted to circumvent or disguise the malicious request; if incomplete, the next probe targets the unresolved aspects until completion or exhaustion of the turn budget. The overall process is formalized in explicit pseudocode as shown in the original source, supporting modular replacement of roles, templates, and languages (Talokar et al., 18 Feb 2026).

Multilingual Support: MM-ART can operate natively or via translation, allowing comprehensive cross-linguistic safety evaluation. Attack prompts are generated, translated, relayed to the target model, and responses are translated back for judge consumption; all history is maintained in both languages. This approach enables evaluation of policy drift and safety filter performance across resource tiers and script families (Singhania et al., 4 Apr 2025, Talokar et al., 18 Feb 2026).

3. Evaluation Metrics, Mathematical Formalism, and Taxonomies

Quantitative assessment in MM-ART is governed by survival analysis and rate-based metrics:

Core Definitions

P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}0

where P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}1 if the response is unsafe, P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}2 otherwise (Singhania et al., 4 Apr 2025).

  • Relative Turn-depth Increase:

P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}3

P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}4

  • Time-to-First-Jailbreak Random Variable (P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}5): The index of the earliest successful attack plan out of P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}6. Survival (“P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}7”) and discovery (“P={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}8”) functions, hazard functions, and restricted mean jailbreak discovery (RMJD) formalize discovery timing and hazard attribution by covariates such as language (Talokar et al., 18 Feb 2026).

Strategy and Error Taxonomies

  • Attacker Strategies (e.g., AF1–AF10): Authority pressure, urgency creation, threat of loss, information harvesting, channel shift, credential engineering, reciprocity trap, sunk-cost trap, rapport building, payment engineering.
  • Defensive Responses (DF1–DF10): Authority verification, deliberate delay, threat de-escalation, data minimization, channel control, credential skepticism, reciprocity resistance, exit readiness, emotional boundary, payment friction.
  • Error Modes (EF1–EF7): Guardrail misalignment, meta-prompting drift, template artifacts, interactive vulnerability, coherence erosion, linguistic artifacts, policy divergence (Yuan et al., 6 Jan 2026).

Topic Modeling and Annotation

Attacks and responses are analyzed via clustering embeddings (e.g., multilingual-e5-base) and topic modeling, with interpretative coding of dialogue segments into strategy and defense families to support qualitative and quantitative attribution (Yuan et al., 6 Jan 2026).

4. Empirical Findings in Multi-lingual and Multi-turn Red Teaming

Extensive experiments across real-world LLM products and open-source models demonstrate that longitudinal, adaptive multi-turn red-teaming exposes substantially more vulnerabilities than previously recognized:

  • Multi-turn Depth Effects: In trials on Bedrock models, average Attack Success Rate (ASR) after five turns in English is 36%, a 71% increase over the initial turn. For Japanese, ASR increases from 34% (turn 1) to 62% (turn 5), representing a 195% increase over the initial English baseline (Singhania et al., 4 Apr 2025).
  • Language-wise Vulnerability: Non-Latin languages attain ASRP={p0,,pn1}P = \{p_0, \ldots, p_{n-1}\}9 rates as high as 68.5% (averaged across models); Japanese reached 71.1%, Hindi 65.8%, and Arabic 59% (Singhania et al., 4 Apr 2025). Notably, cross-lingual hazard varies by model and script: e.g., Telugu had a lower hazard ratio than English in Qwen3-Next, contrary to expectations from chatbot-only research (Talokar et al., 18 Feb 2026).
  • Starter-set and Template Effects: Automated prompt generation via in-context learning (ICL) with models like Mistral-7B or Mixtral-8×7B achieves competitive or superior penetration compared to human-crafted seeds; overfitted single-turn templates (e.g., Multi-Jail) tend to induce higher refusal but lower multi-turn success (Singhania et al., 4 Apr 2025).
  • Error Analysis: Guardrail misalignment, persona drift, and template artifacts are the dominant failure modes, with error profile differing across languages (e.g., Chinese models triggered premature guardrail failures far more frequently) (Yuan et al., 6 Jan 2026).

5. Comparative Frameworks and Simulation Pipelines

Variants of MM-ART have been developed to suit both tool-augmented agents and pure LLM chatbots:

  • STING (Sequential Testing of Illicit N-step Goal execution): Designed for agentic LLMs with tool integration. It leverages phased decomposition, adaptive persona-driven attacks, and LLM judge architectures for refusal and phase completion detection. Empirically, STING yields higher AgentHarm Scores and earlier jailbreaks (as measured by RMJD) than single-turn or chat-oriented baselines—Qwen3-Next improved AHS from 35.1 to 72.7, and RMJD from 3.2 (single-turn) to 6.8 (multi-turn) (Talokar et al., 18 Feb 2026).
  • LLM-to-LLM Scam Simulators: Employ role conditioning for attacker (ScamBot) and defender (VictimBot), with explicit fraud scenario templates and human adjudication. This methodology enables granular analysis of attack taxonomy, escalation dynamics, and defensive counter-strategies (Yuan et al., 6 Jan 2026).
  • Automated Conversation Starter + Probe Pipelines: Systems automatically generate large pools of adversarial starter prompts, run LLM-guided probing loops over five or more dialogue turns (with dynamic translation for multilingual runs), and classify responses post-hoc using state-of-the-art safety judges (Singhania et al., 4 Apr 2025).

6. Implementation Guidelines and Limitations

Implementation Considerations

  • Tool API Exposure: Ensure all agent tools are accessible via a unified interface (e.g., JSON-RPC) and log artifacts for downstream verification.
  • History and Metadata Management: Persist turn-indexed conversation logs, persona parameters, phased plans, and language tags for accurate role and judge functioning.
  • Parallelization and Scalability: Deploy attacker/judge rolls in parallel across strategies and seeds to maximize red-team coverage and efficiency.
  • Defense Stress-testing: Instrument the target agent context with guardrails, refusal filters, and measure standard metrics (AHS/ASR, RMJD) under control and defense conditions (Talokar et al., 18 Feb 2026).

Limitations and Future Directions

Current MM-ART frameworks depend on the selection and accuracy of LLM-based judges, are limited by the depth (commonly ψ\psi0) of probing, and generally focus on text-only interactions. Automatic translation introduces minor but non-negligible variance in attack success rates. Starter-generation also remains bottlenecked by the diversity of seed exemplars.

Proposed research extensions include integrating multi-modal stimuli, extending attack chains to longer turn depths using large-context LLMs, automating and diversifying starter generation, and formalizing topic coverage metrics. Additional defensive enhancements are targeted at real-time, translation-aware safety monitoring and dynamic refusal regeneration (Singhania et al., 4 Apr 2025, Talokar et al., 18 Feb 2026).

7. Significance and Research Trajectory

MM-ART has demonstrated that the compounded risks of adversarial multi-turn, multi-lingual engagements are substantially greater than those captured by single-prompt or English-centric assessments, revealing safety and alignment failures that traditional red-teaming omits. The methodology illuminates cross-linguistic patterns, system-specific vulnerabilities (e.g., linked to internal tool routing and memory), and cross-model error landscapes. MM-ART thereby augments the empirical and methodological foundation for robust LLM deployment, advancing industry and academic understanding of agentic, role-steered, and globally deployed LLM safety (Talokar et al., 18 Feb 2026, Singhania et al., 4 Apr 2025, Yuan et al., 6 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-lingual Multi-turn Automated Red Teaming (MM-ART).