Papers
Topics
Authors
Recent
Search
2000 character limit reached

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Published 11 May 2026 in cs.LG and cs.AI | (2605.10067v2)

Abstract: Red teaming is critical for uncovering vulnerabilities in LLMs. While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

Summary

  • The paper introduces a self-evolving metacognitive loop that adapts red-teaming strategies to bypass LLM defenses.
  • It employs a three-phase approach—introspective diagnosis, adaptive policy formulation, and semantic gradient feedback—to optimize attacks in real time.
  • Empirical results demonstrate up to 89.2% attack success and significant token cost reductions compared to prior search-based methods.

Metis: Self-Evolving Metacognitive Policy Optimization for Jailbreaking LLMs

Introduction and Motivation

The security of deployed LLMs hinges on robust defenses against jailbreak attacks that elicit harmful, unethical, or forbidden content. Despite advances in reinforcement learning from human feedback (RLHF), supervised fine-tuning (SFT), and prompt engineering, state-of-the-art LLMs remain persistently vulnerable to sophisticated adversarial prompts and multi-turn attack strategies. Major limitations of prior automated red-teaming methods include brittle, static heuristic plans, stochastic search over shallow strategic spaces, and an inability to adapt to novel, unseen defense mechanisms in highly-aligned, frontier models.

The paper "Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization" (2605.10067) introduces a paradigm shift: reformulating jailbreaking as an inference-time policy optimization problem within an adversarial POMDP. The core innovation is a self-evolving metacognitive loop that combines introspective diagnosis, adaptive reasoning, and semantic-gradient-guided optimization, enabling substantially improved generalization, interpretability, and attack efficacy.

Framework: Adversarial POMDP and Metacognitive Dynamics

Metis formalizes the attack task as an Adversarial POMDP where the attacker faces an unknown latent defense D\mathcal{D} of the target model. The agent maintains a qualitative belief state over defenses, continuously updated as interaction unfolds. The policy agent employs a three-phase reasoning sequence at each turn: (1) Introspective Diagnosis updates the attacker’s belief about the latent defense used by the target; (2) Adaptive Policy Formulation generates high-level strategies targeted to the diagnosis; and (3) Executable Instantiation compiles the abstract strategy into a concrete prompt.

This pipeline is operationalized in a closed loop, with a separate Evaluator providing dense, structured feedback—a semantic gradient comprising reward, justification, and meta-suggestions—after each round. These meta-suggestions steer both belief updates and subsequent strategic adaptation, enabling rapid, directed policy optimization with explicit reasoning traces. Figure 1

Figure 1: The Metis inference-time optimization loop, where attacker and evaluator agents iterate to diagnose defense logic, reformulate attack strategies, and optimize prompts based on dense introspective feedback.

Innovations and Differentiators

Metis departs from prior red-teaming methods along several crucial axes:

  • Metacognitive Reasoning: Rather than shallow brute-force or stochastic search, Metis leverages introspective causal diagnostics—identifying not just superficial model refusal patterns but the abstract defense philosophy (e.g., intent scrutiny, lexical filtering, semantic coherence) in use, then dynamically tailoring counter-strategies.
  • Inference-Time Policy Optimization: Traditional search-based approaches treat jailbreaking as fixed-template pattern matching, stagnating on robust models. Metis instead frames each attack as an episodic policy optimization, with edge case adaptation and local policy refinement within the inference context window.
  • Structured Feedback as Semantic Gradient: The Evaluator, acting as an automated judge, returns a reward signal augmented by actionable reasoning and state-dependent meta-suggestions. This high-dimensional gradient is critical for efficient exploration of the strategy manifold, mitigating the sparseness and high variance of binary success/failure signals typical in prior RL-based red-teaming.
  • Explicit Reasoning Traces: Each dialogically-generated attack trajectory contains > and <strategy> blocks, making the rationale for choices transparent to downstream LLM safety engineers and adversarial training pipelines.

    Figure 2

    Figure 2

    Figure 2: Novelty (left) exhibits high semantic diversity across generated strategies; t-SNE visualization (right) reveals broad exploration of the attack policy manifold beyond fixed paradigmatic seeds.

    Empirical Results and Ablation

    Extensive experiments on 10 closed- and open-source LLMs, including GPT-4o, GPT-5-chat, Claude-3.7, O1, and Gemini 2.5, across the HarmBench and AdvBench benchmarks, demonstrate that Metis achieves the highest average Attack Success Rate (ASR: 89.2%), with sustained efficacy even on highly-aligned, resilient models (76% on O1, 78% on GPT-5-chat). Against strong baselines—Crescendo, ActorBreaker, X-Teaming, AutoDAN-Turbo, and GCG—Metis outperforms in both success rate and sample efficiency.

    Notable findings include:

    • Generalization, Not Overfitting: Baseline methods (e.g., X-Teaming) show satisfactory results on open-weight models but suffer substantial degradation on cutting-edge targets, with ASR drops as high as 51% on O1 and 49% on GPT-5-chat. Metis narrows this generalization gap, sustaining high performance across all tested models. Figure 3

    Figure 3

    Figure 3: Trajectory visualization: Metis (solid) exhibits highly directed, low-variance optimization paths compared to the redundant, meandering exploration of X-Teaming (dashed) on GPT-5-chat.

    • Efficiency: Through directed, closed-loop semantic-gradient optimization, Metis dramatically reduces token costs—by an average of 8.2×\times and up to 11.4×\times—compared to search-based baselines, while also requiring fewer interaction turns to reach attack success. Figure 4

      Figure 4: Scaling Laws—ASR (left) saturates rapidly with modest turn budgets for most models, while AQS (right) remains low, confirming highly efficient convergence.

    • Ablation and Component Sensitivity: Disabling either the Attacker’s metacognition or dense Evaluator feedback results in significant performance collapse; efficiency and attack rate degrade most rapidly when feedback is weakened, underscoring the importance of structured semantic gradients. Even when seed paradigms are omitted, Metis’s metacognitive loop sustains strong performance, highlighting the paradigm’s ability to synthesize novel attack vectors in situ.

    Theoretical and Practical Implications

    Metis highlights the vulnerability of current LLM defenses—including input perturbation, proxy-based detection, and SFT guards—to attacks that dynamically adapt policy at inference-time based on introspective diagnostics. Two core patterns of vulnerability are identified: (1) offline/online asymmetry, as real-time probing exposes gaps unaddressed by static alignment; (2) narrative-safety goal conflict, where dynamic contexts can induce models to deprioritize safety alignment for the sake of long-term coherence or task completion.

    Empirically, no existing defense mechanism fully thwarts Metis, even under adversarial evaluation on robust alignments. The attack’s metacognitive interpretability further enables its use in safety analytics, adversarial training, and automated vulnerability discovery in LLM pipelines.

    Interpretability and Future Directions

    A salient benefit of Metis is its explicit reasoning trace, which enhances auditing and model improvement cycles. Its metacognitive process aids not only in discovering minimal, highly efficient jailbreaks, but also in mapping the defense landscape of frontier models for advanced adversarial training.

    These results suggest a pressing need for next-generation LLM defenses: ones capable of dynamic, inference-time reasoning about safety in context, as opposed to static, hard-coded rules or classifier-based proxies. Future work may combine metacognitive attack agents with equally adaptive, introspective safety countermeasures (e.g., dynamically adaptive RLHF, context-aware classifiers) and closed-loop adversarial training incorporating Metis-derived trajectories.

    Conclusion

    Metis demonstrates state-of-the-art attack success and unmatched efficiency by recasting LLM red-teaming as an inference-time, metacognitive policy optimization problem. Its self-evolving, reasoning-driven adversarial loop exposes the brittleness of current static and stochastic defenses in frontier LLMs, and its explicit reasoning traces set a new benchmark for interpretability in automated jailbreak research. The findings underscore the necessity of safety paradigms that can dynamically reason about attack intent and adapt policies in real time, closing the gap identified by the persistent vulnerabilities revealed by Metis.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 0 likes about this paper.