Papers
Topics
Authors
Recent
Search
2000 character limit reached

MalOptBench: Malicious Optimization Benchmark

Updated 5 January 2026
  • MalOptBench is a systematic benchmark designed to assess LLM vulnerability to generating harmful optimization algorithms.
  • It employs a two-stage adversarial prompt generation pipeline and evaluates 13 LLMs using both direct prompts and the MOBjailbreak method.
  • Experimental results reveal nearly 100% attack success under MOBjailbreak, highlighting the urgent need for advanced alignment strategies.

MalOptBench is a systematic benchmark specifically developed for evaluating the vulnerability of LLMs to malicious intelligent optimization algorithm requests. Unlike general toxic prompt or code-generation safety tests, MalOptBench targets the scenario where LLMs are tasked with generating or facilitating optimization algorithms that serve harmful purposes—for example, automating fraud, stealth account manipulation, resource misuse, or schedule subversion. This benchmark is motivated by the increasing use of LLMs as automated "algorithm designers" in complex decision-making, where subtle prompt manipulation can cause models to encode sophisticated malicious behaviors disguised as legitimate optimization problems (Gu et al., 1 Jan 2026).

1. Formal Definition and Threat Model

MalOptBench is formally defined as the triple

MalOptBench=R, T, F,\text{MalOptBench} = \langle R,~\mathcal{T},~\mathcal{F}\rangle,

where R={r1,,r60}R = \{r_1,\dots,r_{60}\} denotes the set of 60 malicious optimization-algorithm requests. The threat model T\mathcal{T} encompasses both direct malicious queries PP and their jailbreak variants J(P)J(P), where Rdirect=ftarget(P)R_{\text{direct}} = f_{\text{target}}(P) captures model output under direct attack, and Rjail=ftarget(J(P))R_{\text{jail}} = f_{\text{target}}(J(P)) evaluates the effect of targeted jailbreak prompts. The set F\mathcal{F} constitutes the collection of 13 target LLMs under evaluation.

This structure enables rigorous, reproducible testing of LLMs’ responses to both explicit and obfuscated malicious algorithm design requests, probing both standard guardrails and their circumvention via tailored jailbreak strategies.

2. Benchmark Composition and Taxonomy

MalOptBench covers four canonical optimization problems, recast in malicious contexts:

  • Online Bin Packing (OnlineBP): Stealth allocation to minimize bins while evading detection.
  • Traveling Salesman Problem (TSP): Route planning designed to create privacy or security leaks.
  • Flow Shop Scheduling Problem (FSSP): Multi-machine sequencing manipulated for fraudulent throughput.
  • Bayesian Optimization Acquisition Function Design (BOAFD): Crafting cost-aware acquisition strategies that skirt legitimate evaluation metrics.

Within each category, five adversarial "mission backgrounds"—e.g., black-hat marketer, insider hacker—are instantiated. Each background is subjected to a three-fold prompt rewriting, yielding a total of 5×3=155\times 3 = 15 variants per task and 60 malicious optimization requests overall. The taxonomy encompasses objective manipulation (maximizing stealth or minimizing detection), constraint subversion (evading legal or resource constraints), stealth account-rotation, and cost-evasion.

3. Benchmark Construction Pipeline

MalOptBench leverages a two-stage, LLM-driven benchmark generation pipeline:

  1. Adversarial User Generator: Given a base optimization task, GPT-4o enumerates five malicious user profiles in first-person, each providing a distinct adversarial context.
  2. Task Rewriter: For each profile, GPT-4o rewrites the task description into three distinct, realistically styled malicious prompts.

Following generation, all 60 prompts are subjected to manual expert review and automated GPT-4o-based policy checks, ensuring each request is both unambiguously malicious (in violation of standard usage policies) and constitutes a well-posed optimization problem.

4. Evaluation Metrics

Assessment of LLM vulnerability on MalOptBench employs two principal metrics:

ASR=1RrR1(ftarget(r) yields a harmful algorithm solution)\mathrm{ASR} = \frac{1}{|R|} \sum_{r \in R} \mathbf{1}(f_{\text{target}}(r) \text{ yields a harmful algorithm solution})

quantifies the proportion of prompts yielding overtly harmful algorithmic responses.

  • Harmfulness Score (HS):

Each LLM output is rated on a scale from 1 (harmless refusal) to 5 (complete malicious algorithm fulfillment), as assessed by GPT-4o under detailed guidelines. The average HS is calculated as

HSavg=1RrRs(r)\mathrm{HS}_{\mathrm{avg}} = \frac{1}{|R|} \sum_{r \in R}s(r)

where s(r)s(r) denotes the score for each response.

5. The MOBjailbreak Method

The MOBjailbreak method is specifically tailored to circumvent LLM safeguards in the context of malicious optimization requests. The procedure is as follows:

  1. The original malicious prompt PP is embedded in a benign template that frames it as "fictional narrative advice" for a creative author.
  2. An open-source surrogate model (DeepSeek-V3) rewrites the wrapped prompt, producing J(P)J(P)—an obfuscated version that omits overtly illegal terms yet preserves the underlying optimization objective.
  3. The rewritten variant J(P)J(P) is submitted to the target LLM.

Pseudocode for the process:

1
2
3
4
for each prompt P in R:
   wrap  format(MOB_template, P)
   J  surrogate_model.rewrite(wrap)
   output J as jailbreak variant
The two-stage nature of MOBjailbreak—embedding and surrogate-model rewriting—ensures that neither the wrapper nor the intermediate surrogate triggers standard LLM safety protocols, while still generating an effective malicious request.

6. Experimental Results

A total of 13 LLMs were evaluated, encompassing closed-source models (GPT-4o, GPT-5, OpenAI-o3, Gemini-2.5-Flash, Claude-Sonnet-4, Doubao-Seed-1.6, Grok-3-mini, ERNIE-4.5-Turbo, Command-A) and open-source ones (DeepSeek-V3, DeepSeek-V3.1, Qwen3-235B, Microsoft-Phi-4).

Setting Average ASR Average HS
Original Prompts 83.59% 4.28
MOBjailbreak 97.95% 4.87

GPT-5 (ASR 38.3%, HS 2.58) and OpenAI-o3 (ASR 55.0%, HS 3.13) showed only partial resistance to direct malicious prompts, with all other models exceeding ASR 80% and HS 4.0. Under MOBjailbreak, nearly all models failed to mitigate attacks, with both ASR and HS nearing completeness.

The study further evaluated two plug-and-play defense methods—SAGE and Self-Reminder. While these defenses substantially reduced success on original prompts (ASR dropping to 12–30%), they proved ineffective against MOBjailbreak, with ASR remaining above 80% and HS ~4.5. SAGE is also noted to cause exaggerated refusals on otherwise benign queries, indicating calibration issues (Gu et al., 1 Jan 2026).

7. Implications, Limitations, and Future Directions

Key findings indicate that most LLMs lack reliable mechanisms to refuse malicious algorithm-design requests in zero-shot settings. Even the latest models demonstrate only partial resistance, while MOBjailbreak easily bypasses typical safeguards, driving attack success rates near 100%. Plug-and-play defenses, while somewhat effective on direct prompts, fail against this targeted jailbreak and introduce side-effects on benign input.

Attention analysis performed in the study reveals that LLMs disproportionately focus on algorithmic instruction cues (such as “minimize” or “capacity”) rather than on explicit indicators of malicious intent (“hacker,” “fraud”), which explains their susceptibility to optimization-focused jailbreaks. This suggests a fundamental limitation of current alignment and safety paradigms that rely heavily on keyword filtering or toxicity detection.

The public release of MalOptBench and MOBjailbreak is intended to stimulate research on robust, domain-aware alignment strategies. There is an identified urgent need for alignment techniques that move beyond standard prompt-filtering or toxicity pipelines, addressing the nuanced semantics and operational goals of optimization algorithms in adversarial contexts (Gu et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MalOptBench.