Black-box Optimization of LLM Outputs by Asking for Directions (2510.16794v1)

Published 19 Oct 2025 in cs.CR and cs.LG

Abstract: We present a novel approach for attacking black-box LLMs by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this link.

Summary

The paper introduces a query-based black-box optimization method using binary comparison queries to generate adversarial examples for LLMs.
It demonstrates robust attack success rates up to 98% in diverse domains including vision, prompt injections, and jailbreak scenarios.
The findings expose a security paradox: larger, well-calibrated models are more vulnerable to adversarial attacks, highlighting the need for improved defenses.

Black-box Optimization of LLM Outputs by Asking for Directions

Introduction and Motivation

The paper introduces a query-based black-box optimization framework for attacking LLMs in scenarios where only textual outputs are available. This setting is increasingly prevalent in production deployments, where API providers restrict access to model internals such as logits or token-level probabilities. The proposed method leverages the introspective capabilities of LLMs by prompting them to perform binary comparisons between candidate inputs, thereby extracting a calibrated optimization signal for adversarial input generation. The approach is demonstrated across three attack domains: adversarial examples for vision-LLMs, prompt injections, and jailbreaks.

Figure 1: Illustration of the optimization pipeline for adversarial examples, jailbreaks, and prompt injections using binary comparison queries.

Methodology: Query-based Black-box Optimization

Problem Formulation

The attack is formalized as an iterative optimization problem. Given a target model $\mathcal{M}$ , an initial input $x$ , and an attack goal $\mathcal{G}$ , the objective is to find an adversarial input $x^{\text{adv}}$ such that $\mathcal{M}(x^{\text{adv}}) \in \mathcal{G}$ , subject to domain-specific constraints (e.g., $\ell_\infty$ norm for images, suffix length for text).

Optimization Signal Extraction

A key insight is that LLMs are poorly calibrated when asked for absolute confidence scores, often responding with extreme or uninformative values. In contrast, when prompted to compare two inputs and select the one closer to the attack goal, models exhibit significantly better calibration. This enables a hill-climbing optimization strategy: at each iteration, a perturbed candidate is generated and the model is queried to select the preferred input. If the candidate is preferred, it becomes the new iterate.

Figure 2: Left: Absolute confidence queries yield uninformative, discretized values. Right: Binary comparison queries produce true positives aligned with ground-truth, enabling effective optimization.

Perturbation Strategies

Vision-LLMs: The Square Attack is adapted, applying random $\ell_\infty$ -bounded perturbations to image regions.
Prompt Injection/Jailbreaks: Suffixes are appended to prompts; random token replacements within the suffix are used to generate candidates.

Experimental Results

Adversarial Examples for Vision-LLMs

The method is evaluated on multiple vision-LLMs using ImageNet samples. Query-based optimization achieves attack success rates (ASR) ranging from 5% to 50% using only text responses. Notably, larger models (e.g., Qwen2.5-VL-72B-Instruct, GPT-4o mini) are more susceptible, with ASRs up to 50.8%. When transfer-based adversarial examples are used for initialization, ASR increases further, and the hybrid approach (transfer + query-based optimization) consistently outperforms individual methods.

Targeted Attacks

Targeted adversarial attacks are more challenging; smaller models fail to differentiate between complex comparative queries, but larger models (e.g., GPT-5 mini) achieve up to 79% ASR when log-probabilities are available.

Figure 3: Targeted-attack success rates with log-probability access for Qwen and Llama model families.

Prompt Injection Attacks

The binary comparison approach substantially improves prompt injection ASR across all tested models, reaching up to 87.5% on GPT-4o mini. The method is competitive with log-probability-based optimization where such access is available, and is the only viable approach for models that do not expose log-probabilities.

Jailbreak Attacks

For jailbreaks, the method achieves near-perfect ASR ( $\geq98\%$ ) on most models with low query complexity. In some cases, the query-based approach outperforms logprob-based optimization, likely due to reduced activation of model safety defenses.

Figure 4: Failed attempt with a random suffix, highlighting the importance of optimized adversarial suffixes.

Analysis and Implications

Security Paradox

A central finding is that model capability improvements (e.g., scale, alignment, calibration) directly increase vulnerability to this class of attacks. Larger, better-calibrated models are easier to optimize against using binary comparison queries, contradicting the expectation that improved models are more robust.

Failure Modes

Poor Reasoning: Small models fail to perform meaningful comparisons, reducing attack efficacy.
False Positives/Negatives: Misclassifications slow optimization or reduce ASR.
Alignment Defenses: Strongly aligned models may refuse to answer comparison queries, blocking optimization.

Advanced Optimization and Defenses

The paper suggests that ensemble querying, prompt engineering, and more sophisticated perturbation strategies could further improve attack robustness. Defensively, models could be trained to refuse comparative feedback for security-sensitive queries, and API providers could implement detection of iterative query patterns.

Practical and Theoretical Implications

The approach expands the adversarial toolkit for black-box attacks, demonstrating that text-only output is sufficient for effective optimization. This has direct implications for the security of deployed LLMs, especially in agentic and multi-modal systems. The findings necessitate a reevaluation of alignment and safety strategies, as introspective capabilities can be weaponized for adversarial purposes.

Conclusion

The paper establishes that binary comparison queries provide a reliable optimization signal for black-box adversarial attacks against LLMs, even in text-only output settings. The method is broadly applicable across vision, prompt injection, and jailbreak domains, and is particularly effective against larger, more capable models. These results highlight a critical security paradox and underscore the need for new defensive paradigms in LLM deployment and alignment.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Black-box Optimization of LLM Outputs by Asking for Directions — Explained Simply

Overview

This paper is about finding a new way to “trick” LLMs like GPT or Claude when you only see their text replies. Instead of using secret internal data from the model (which most people don’t have), the authors show how to attack these systems by simply asking them to compare two options and say which one is “closer” to a goal. Think of it like playing the “hot or cold” game to find a hidden object: you keep asking which step is warmer, and use that to move in the right direction.

What Questions Did the Researchers Ask?

The paper asks three main questions, in simple terms:

Can we attack LLMs that only show text replies by asking them to compare two inputs and tell us which is better?
Will this work across different types of attacks, like:
- Making image+text models mislabel pictures,
- Getting LLMs to bypass safety rules (jailbreak),
- Sneaking harmful instructions into data that LLMs process (prompt injection)?
Are bigger, smarter models harder or easier to attack with this method?

How Did They Do It?

The authors use a method that’s like climbing a hill step-by-step (called “hill climbing”):

Start with an input (an image or a prompt).
Make a small change to it (a “perturbation”)—like nudging a few pixels in a picture, or editing a few words in a suffix.
Ask the model a simple comparison question: “Which one is closer to making the target happen?” For example:
- For images: “Which image is less likely to be a dog?”
- For jailbreaks: “Which prompt is more likely to get the model to say ‘Sure’ at the start?”
- For prompt injection: “Which instruction is more likely to trigger sending an email?”
If the model picks the new version, keep it and repeat. This is like asking for directions at every step: take a step, ask if you're getting warmer, and keep going.

Important idea: The authors discovered that LLMs are bad at giving exact confidence numbers (like “I’m 95% sure”), but they’re much better at saying which of two choices is better. So instead of asking for a score, they ask for a preference. That preference becomes the “signal” to guide the attack.

They tested this on:

Vision-LLMs (models that look at images and answer in text),
Jailbreaks (bypassing safety to get harmful responses),
Prompt injections (hiding instructions in content to trigger actions).

They only used text outputs—no secret model data, no extra helper models.

What Did They Find?

Here are the key results, explained simply:

Comparison works, scoring doesn’t: When asked for exact confidence numbers, models gave poor, unhelpful answers (like always saying 0%, 50%, or 99%). But when asked “Which of these two is better?”, models gave reliable guidance often enough to make progress.
It works in text-only settings: Even without access to internal scores or log probabilities, their method achieved strong attack success rates across tasks and models. In many cases, it matched or came close to methods that use more insider information.
Bigger models are easier to attack with this method: Surprisingly, larger and more capable models (like stronger versions of GPT or Llama) are better at making comparisons, which makes them more vulnerable to this type of attack. This is a “security paradox.”
Low query counts: The attacks typically needed a modest number of tries (often somewhere between about 5 and a few hundred queries) to succeed.

Why Is This Important?

This matters because it widens the attack surface of real-world AI systems. Many companies try to hide internal data to make attacks harder. But this method shows that you can still steer the model using just its text replies—no secret access needed. It also warns that improving a model’s reasoning and honesty (calibration) can unintentionally make it easier to attack through careful comparisons.

Implications and Impact

For AI developers: Simply hiding confidence scores or limiting APIs to text isn’t enough to stop optimization-based attacks. Models that are better at explaining themselves can be more vulnerable when attackers use that self-awareness against them.
For safety teams: Possible defenses include teaching models to refuse certain comparative questions that could aid attacks, limiting feedback that helps iterative optimization, and detecting repeated, similar queries that look like “hill climbing.”
For the future: As LLMs get stronger, this approach could be applied to more systems and tasks, making security even more important.

In short, the paper shows a clever, simple idea: you can “ask for directions” from an LLM to climb toward a harmful goal. And because LLMs are good at comparing options, this works well—even when you only see text and nothing else.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps the paper leaves unresolved, each phrased to be actionable for future research.

External validity across tasks and settings: Evaluate the method on a broader set of real-world applications (e.g., multi-turn agents with tool use/browsing/RAG, code assistants, medical/legal QA) beyond ImageNet, AgentDojo, and AdvBench.
Scale and statistical rigor: Increase sample sizes and report confidence intervals, variance across seeds, and run-to-run stability to substantiate reported ASRs and comparative trends.
Comparative-prompt sensitivity: Systematically paper how wording, ordering, and formatting of binary comparison prompts affect refusal rates, calibration, and attack success across models and providers.
Determinism and sampling effects: Quantify the impact of decoding parameters (temperature, top-p, nucleus sampling), response randomness, and repeated querying on the reliability of pairwise preferences.
Robust response parsing: Develop and evaluate robust parsing protocols (e.g., regex with fallbacks, structured tool calling, self-check prompts) for extracting binary decisions when models deviate from “0/1” instructions.
Preference calibration metrics: Move beyond anecdotal plots and benchmark comparative calibration using standard metrics (e.g., accuracy vs. logit margin, AUC, Brier/ECE-style measures adapted to pairwise labels) across models and domains.
Noise-aware optimization: Model and estimate a noise profile for the preference oracle (false positive/negative rates vs. margin) and design algorithms provably robust to such noise (e.g., dueling bandits with noisy comparisons, repeated-majority queries).
Query efficiency theory: Provide theoretical analysis of convergence and query complexity for hill-climbing with noisy LLM preferences, including conditions that guarantee improvement and bounds under realistic noise models.
Beyond binary comparisons: Explore k-ary tournaments, preference learning, and Bayesian optimization with pairwise feedback to reduce queries and avoid local optima.
Better perturbation strategies: Replace random token swaps and Square Attack-like sampling with structured search (CMA-ES, evolutionary strategies, gradient estimation via preferences, RL with pairwise rewards) to improve query efficiency.
Success detection formalization: Define and evaluate robust success criteria in text-only settings (beyond ad hoc “Sure”-prefix or single-judge labels), including multi-judge consensus, calibrated reward models, or task-grounded validators.
Judge reliability and bias: Quantify agreement between automatic judges and human raters for jailbreak harmfulness; assess cross-model bias (same-vendor judging), and measure false positives/negatives of harm detection.
Vision-LLM evaluation validity: Validate that “Does this image contain X?” queries faithfully reflect classification decisions; compare against logit/decoder evidence where possible and assess brittleness to prompt phrasing.
Perceptual imperceptibility: For image attacks with ε = 32/255, conduct human perceptual studies or use perceptual metrics (e.g., LPIPS) to confirm imperceptibility; evaluate across display/codec/resize pipelines.
Physical-world robustness: Test adversarial images in physical or end-to-end capture settings (print–scan, screen–camera) to assess real-world viability.
Targeted attack decomposition: For “contain target but not source,” investigate alternative comparative decompositions (e.g., two binary queries, margin-style questioning) and curriculum prompting to aid weaker models.
Agentic threats breadth: Extend prompt-injection evaluation beyond 14 types and 4 tasks to diverse tools (filesystem, web, cloud APIs), longer workflows, and high-impact actions (data exfiltration, financial transactions).
Transferability analysis post-optimization: Measure how inputs optimized via introspection on one model transfer to others, and whether preference-guided search overfits target-specific quirks.
Cross-model/family controls: Rigorously test the “security paradox” (larger models more vulnerable) under controlled settings within a family (same safety layer, decoding, temperature) to separate capability vs. policy confounds.
API and policy constraints: Assess feasibility under realistic rate limits, cost models, caching, and safety monitors; quantify total dollar/time cost and detectability under production telemetry.
Stealth and detectability: Design and evaluate stealthier query schedules (spacing, camouflage tasks, paraphrasing) and measure detection by stateful filters or anomaly detectors.
Refusal circumvention: Develop principled ways to elicit useful preferences despite safety refusals (e.g., indirect questioning, meta-preferences, chain-of-critique) without violating policies.
Defense effectiveness trade-offs: Empirically test proposed defenses (confidence-expression suppression, randomized refusals, response smoothing, preference-noise injection) for both attack reduction and utility degradation.
Adversarial training on preferences: Investigate whether training models to be unreliable (or intentionally noisy) on adversarial comparative queries reduces attackability while preserving benign comparative utility.
Universal adversarial artifacts: Explore whether preference-guided optimization can yield universal adversarial suffixes or image perturbations transferable across inputs and models.
Language and modality coverage: Test multilingual attacks and extend to other modalities (audio, video, speech-to-text) to assess generality of preference-guided optimization.
Context and system-prompt robustness: Evaluate attacks when targets are wrapped by system prompts, content filters, or tool-spec JSON schemas typical of production apps.
Minimal-edit and fluency constraints: For text attacks, measure the minimal suffix length/edit distance required for success and assess human detectability/fluency impacts.
Reproducibility and artifacts: Release code, prompts, and seeds; specify API versions, temperatures, and refusal handling to enable replication and longitudinal testing as providers update models.
Comparison to contemporary baselines: Benchmark against recent query-only attacks (e.g., QROA, MIST) under identical budgets to isolate the gains attributable to comparative introspection.
Long-horizon optimization: Study performance over longer budgets (beyond 1,000 queries), including convergence behavior and diminishing returns, to inform practical attack planning.
Safety side effects of “harmless” comparisons: Quantify whether reframing as comparisons systematically bypasses safety layers more than direct optimization, and whether this generalizes across providers.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are actionable applications that can be deployed now, derived from the paper’s comparative-confidence, binary-preference optimization method for text-only black-box LLMs.

Bold Red Teaming for LLM APIs and Products — sectors: software, platform security, content platforms
- Tools/products/workflows: implement a “Binary-Preference Query Fuzzer” that iteratively asks the target model which of two candidate inputs is closer to an attack goal (e.g., jailbreak, prompt injection), then hill-climbs; integrate into CI/CD for pre-release and periodic audits; produce risk reports and attackable-surface maps.
- Assumptions/dependencies: text-only access to target model responses; model permits benign comparative queries; sufficient query budget; legal and ethical authorization to test; logging of queries and outcomes.
Automated Prompt-Injection Assessment for LLM Agents — sectors: enterprise software, RPA, customer support, data analytics
- Tools/products/workflows: adapt AgentDojo-like scenarios for internal “function-call misuse audits”; feed structured tasks with embedded adversarial suffixes; use binary comparison queries to optimize adversarial suffixes that trigger unauthorized actions; add a “Function-Call Approval Gate” with provenance checks.
- Assumptions/dependencies: instrumented agent frameworks (function-calling, audit logs); ability to evaluate malicious action success; sandboxed environment; detection and rate limits tuned to allow controlled testing.
Jailbreak Robustness Testing of Safety-Aligned Models — sectors: healthcare, education, enterprise chatbots, safety-critical content generation
- Tools/products/workflows: “Jailbreak Surface Scanner” that discovers effective adversarial suffixes via binary comparisons; judge models or policy heuristics to score harmful content; integrate findings into alignment updates and guardrails.
- Assumptions/dependencies: policy-compliant testing contexts; access to judge models or clear harmfulness criteria; model not hard-refusing all comparative tasks.
Vision-LLM Adversarial Robustness Audits — sectors: e-commerce (product recognition), accessibility tools, content moderation, document processing
- Tools/products/workflows: “Text-only VLM Adversarial Audit” using Square Attack-style perturbations guided by binary comparisons; evaluate untargeted and targeted misclassification risk; optionally seed with transfer-based priors to reduce query counts.
- Assumptions/dependencies: ability to submit two images in a single comparative query; domain-appropriate perturbation bounds ( $\ell_\infty$ constraints); models that answer visual comparative questions without refusal.
Stateful Query Anomaly Detection and Rate-Limiting — sectors: API providers, platform security, cloud SaaS
- Tools/products/workflows: deploy “Iterative Optimization Detector” that flags runs of highly similar queries indicative of hill-climbing; throttle or challenge such sessions; optionally block “which of these two inputs is more likely…” patterns for unsafe domains.
- Assumptions/dependencies: session-level logging; heuristics tuned to minimize false positives; privacy and compliance controls.
Alignment Policy Updates to Refuse Comparative Optimization on Unsafe Goals — sectors: model providers, safety teams
- Tools/products/workflows: update instruction-tuning datasets to include refusal behaviors for comparative optimization on harmful objectives (jailbreaks, injections); add pre-response filters to detect comparison tasks that could reveal optimization signals.
- Assumptions/dependencies: alignment changes maintain utility on allowed comparative tasks; monitoring to prevent capability regressions in legitimate comparisons.
Compliance and Risk Governance for AI Features — sectors: regulated industries (healthcare, finance, government), enterprise risk management
- Tools/products/workflows: add text-only red-team evaluations to security controls (SOC2/ISO/AI governance); periodic adversarial assessments; document defense efficacy against query-based optimization; vendor due diligence checklists include binary-preference attack resilience.
- Assumptions/dependencies: executive buy-in; policy frameworks accept such testing as evidence; budget for continuous red teaming.
Developer Best Practices and Education — sectors: software engineering, product management; daily life for power users of LLM tools
- Tools/products/workflows: publish “Prompt Injection Hygiene” guidelines (sanitize untrusted inputs, segregate data from instructions, avoid asking models to compare candidate prompts on action likelihood); create checklists for agent builders.
- Assumptions/dependencies: teams adopt and enforce practices; tooling support (context isolators, prompt firewalls).
Input Provenance Isolation for Agents — sectors: enterprise apps, knowledge management
- Tools/products/workflows: “Input Isolator” that strictly separates untrusted content from system instructions; robust context segmentation; explicit allowlists for action-triggering tokens; minimal coupling of content and control prompts.
- Assumptions/dependencies: agent framework supports provenance metadata; performance impact acceptable.
A/B Defense Evaluation with Comparative Queries — sectors: model providers, safety research
- Tools/products/workflows: run with/without comparative-refusal defenses; measure attack success rate (ASR), mean queries to success, and false refusals in benign tasks; use data to tune refusal criteria.
- Assumptions/dependencies: reproducible test suites; judge models or rules to score harmful outputs.

Long-Term Applications

The following applications require further research, scaling, or cross-organizational development to become robust and broadly deployable.

Counter-Optimization Training and Selective De-Calibration — sectors: model providers, safety research
- Tools/products/workflows: train models to degrade comparative calibration specifically on unsafe objectives while preserving general-task calibration; incorporate adversarial training on comparative prompts; build “attackability score” metrics for release gating.
- Assumptions/dependencies: methods avoid broad utility loss; reliable classifiers to distinguish malicious vs benign comparative intents; robust evaluation suites.
Cross-Session Iterative-Attack Detection and Privacy-Preserving Telemetry — sectors: API platforms, cloud security
- Tools/products/workflows: advanced stateful detection across accounts/IPs to spot slow, distributed hill-climbing; privacy-preserving aggregation; “Query-Shaping” that adds randomization to weaken optimization signals on unsafe domains.
- Assumptions/dependencies: regulatory compliance for telemetry; scalable storage; acceptable latency; careful trade-offs to avoid breaking legitimate use.
Secure Agent Architectures with Capability Firewalls — sectors: robotics, industrial automation, healthcare IT, finance ops
- Tools/products/workflows: provenance-aware policy engines, action gating, multi-channel verification before high-risk function calls; fallbacks to non-LLM controllers; “semantic sanitizers” that scrub or compartmentalize untrusted inputs.
- Assumptions/dependencies: system redesign to separate planning from acting; human-in-the-loop escalation; safety cases for regulators.
Universal Adversarial Suffix and Image Perturbation Defenses — sectors: content platforms, moderation, VLM products
- Tools/products/workflows: detectors for generative adversarial suffix families and imperceptible visual perturbations; robust training with synthetic attacks; “negative example generators” that strengthen guardrails against families of comparative-optimized attacks.
- Assumptions/dependencies: scalable synthesis of adversarial data; minimal false positives; cross-model generalization.
Sector-Specific Resilience Programs
- Healthcare: audits for clinical assistants and EHR summarizers to prevent injection-driven data exfiltration; “PHI-protect agents” with strict function gating and context isolation.
- Finance: trading and compliance assistants with safe-mode defaults; dual-control workflows for actions triggered by untrusted inputs; adversarial testing embedded in model risk management.
- Robotics/IoT: LLM-driven controllers with layered safety nets and certified fallback policies; visual-perception robustness testing against text-guided VLM attacks.
- Assumptions/dependencies: domain regulations; certification pathways; cross-vendor interoperability; budget for continuous evaluation.
Standardized Benchmarks and Certification for Text-Only Adversarial Robustness — sectors: standards bodies, regulators, procurement
- Tools/products/workflows: “Text-only Adversarial Robustness Benchmarks” (VLM and LLM) with ASR, mean query budgets, refusal quality; third-party certification; procurement requirements for public-sector deployments.
- Assumptions/dependencies: community consensus; validated scoring protocols; compatibility with existing AI assurance frameworks.
Cyber Insurance and AI Risk Underwriting — sectors: insurance, enterprise risk
- Tools/products/workflows: underwriting models that factor attackability via binary-preference methods; premium adjustments based on demonstrated defenses; periodic re-assessments.
- Assumptions/dependencies: accepted, standardized testing; access to audit artifacts; alignment with legal frameworks.
Cryptographic Attestation and Watermarking of Contexts — sectors: platform security, enterprise apps
- Tools/products/workflows: sign and verify trusted instruction blocks; watermark system prompts; enforce strict separation from user content so comparative attacks can’t exploit mixed contexts.
- Assumptions/dependencies: changes to agent frameworks and APIs; key management; adoption across tooling ecosystems.
Economic Optimization of Defense vs Attack Costs — sectors: platform economics, engineering management
- Tools/products/workflows: “Attack Cost Estimators” that model queries and time to success under various defenses; defense ROI dashboards; automated policy tuning.
- Assumptions/dependencies: data on attack rates; accurate modeling of user impact; continuous monitoring pipelines.
Fundamental Research on Calibration and Introspection — sectors: academia, model science
- Tools/products/workflows: studies on why binary comparisons are better calibrated than absolute confidence; new training objectives and evaluation metrics; exploration of the capability–vulnerability paradox and how to decouple them.
- Assumptions/dependencies: access to diverse models; reproducible datasets; collaboration with model providers.

View Paper Prompt View All Prompts

Glossary

Adversarial examples: Inputs intentionally perturbed to cause a model to make incorrect predictions. "adversarial examples that cause misclassification (untargeted) or force specific incorrect predictions (targeted, e.g., misclassifying a dog image as a fish)."
Adversarial optimization: The process of iteratively improving adversarial inputs to better achieve an attack goal. "sufficiently calibrated to enable effective adversarial optimization."
Attack surface: The set of ways an adversary can interact with or exploit a system. "thereby dramatically expanding the attack surface for deployed LLMs."
Auxiliary models: Additional models used to provide signals or guidance in attacks. "No auxiliary models: Our method does not rely on any surrogate or auxiliary models."
Binary comparisons: Pairwise evaluations where a model chooses which of two inputs is closer to an objective. "we propose a more effective approach by reformulating the optimization problem as a series of binary comparisons."
Black-box: A setting where the attacker can query a model but cannot access its internal parameters or outputs beyond text. "attackers face black-box scenarios in which they can only query the model through an API."
Calibration: The alignment between a model’s expressed confidence and actual correctness. "better and larger models exhibit superior calibration when expressing confidence"
Chain-of-thought: Step-by-step reasoning tokens produced by an LLM. "self-consistency (the response agreement across multiple independent chain-of-thought samples)---is a good predictor of correctness and improves reliability"
CLIP models: Vision-LLMs that align images and text in a shared embedding space. "we leverage three CLIP models of varying sizes to generate transferable adversarial examples"
Comparative confidence: A model’s ability to judge which of two inputs better meets a goal, rather than assign absolute scores. "larger and more capable models tend to be better calibrated when expressing comparative confidence"
Confidence scores: Numerical indicators of a model’s certainty about its output. "continuous outputs like logits or confidence scores (which are rarely available in practice)"
Constraint set: The allowable region of perturbations around an input under attack. "where $\mathcal{C}(x)$ defines the feasible constraint set around the original input $x$ ."
Ensemble attack: A strategy that runs multiple attack variants and selects the best-performing result. "The Ensemble attack, which runs all attack variants and picks the best, achieves the highest success rates across all models"
Hill-climbing: A local search optimization that iteratively moves to better neighboring candidates. "This comparative capability enables an effective ``hill-climbing'' optimization strategy"
Imperceptible perturbations: Changes to inputs that are small enough to be unnoticed by humans but affect model decisions. "apply imperceptible perturbations to images to cause misclassification"
Introspection: A model’s ability to reflect on and express properties of its own outputs (e.g., confidence). "This approach leverages the model's own ``introspection'' capabilities rather than relying on external proxies"
Jailbreaks: Prompt-based attacks that bypass safety mechanisms to elicit harmful content. "We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections."
L∞ ball: The set of points within a maximum per-coordinate distance from an input, used to bound perturbations. "we constrain perturbations within an $\ell_\infty$ ball to maintain visual imperceptibility:"
Log-probabilities: The logarithms of token probabilities output by a model, often used for optimization. "access to model log-probabilities"
Logits: Pre-softmax scores output by a model that indicate confidence for classes or tokens. "continuous outputs like logits or confidence scores"
Miscalibration: A mismatch between a model’s expressed confidence and the actual likelihood of correctness. "often exhibiting severe miscalibration and collapsing to a small set of stereotypical values (e.g., 0%, 50%, or 99%)."
Preference-guided search: Optimization driven by a model’s choices among candidate inputs. "This preference-guided search concentrates queries on high-risk inputs and ultimately yields effective attacks."
Prompt injections: Malicious instructions embedded in data or prompts to cause unintended model actions. "injecting malicious instructions into data processed by LLM-powered agents"
Proxy signals: Indirect feedback (often from other models) used to guide attacks when direct signals are unavailable. "rely on proxy signals from other models."
Query-based attacks: Methods that iteratively query the target model to refine adversarial inputs. "Query-based attacks. These attacks perform optimization by repeatedly querying the target model to refine adversarial inputs based on observed responses"
Query budget: A cap on the number of model queries allowed during an attack. "We fix a query budget of 1,000"
Safety alignment: Training and policies that steer models away from producing harmful outputs. "Elicit harmful or prohibited responses from safety-aligned LLMs by bypassing their safety mechanisms."
Self-consistency: Agreement among multiple independent samples of a model’s reasoning, used as a confidence proxy. "self-consistency (the response agreement across multiple independent chain-of-thought samples)---is a good predictor of correctness and improves reliability"
Square Attack: A black-box adversarial attack that perturbs random image squares under norm constraints. "We adopt the Square Attack~\citep{andriushchenko2020square}"
Strong alignment defenses: Robust safety mechanisms that cause models to refuse potentially harmful queries. "Strong alignment defenses: Heavily aligned models may refuse to engage with the comparison task entirely, responding with rejection messages such as ``Sorry, I cannot help.''"
Surrogate model: A secondary model used to craft adversarial examples that may transfer to the target. "optimizing adversarial inputs on a local surrogate model and transferring them to the target"
Targeted attacks: Attacks aiming to force a specific incorrect prediction or behavior. "We further test our method on more challenging targeted attacks."
Text-only setting: An interface that returns only natural-language outputs without numeric confidences or internals. "the most challenging and increasingly common scenario is what we term the text-only setting"
Threat model: Assumptions about an attacker’s capabilities and access when evaluating security. "We consider a realistic black-box threat model in which the attacker has only query access to the target model $\mathcal{M}$ ."
Token log-probabilities: Per-token probability logs that quantify model confidence in generated text. "logits, or token log-probabilities"
Transfer-based attacks: Attacks that craft adversarial inputs on one model and apply them to another. "Transfer-based attacks have a long history in adversarial examples"
Transferability: The tendency of adversarial inputs to remain effective across different models. "rely primarily on transferability."
Untargeted attacks: Attacks that aim for any incorrect outcome rather than a specific target. "misclassification (untargeted)"
Vision-LLMs: LLMs that incorporate vision capabilities to process images. "adversarial examples for vision-LLMs"
White-box: A setting where the attacker has full access to model parameters, gradients, and internals. "In white-box settings, where attackers have full access to model parameters and gradients"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (5)

Collections

Tweets

This paper has been mentioned in 7 tweets and received 71 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

Black-box Optimization of LLM Outputs by Asking for Directions (16 likes, 0 questions)

Black-box Optimization of LLM Outputs by Asking for Directions (2510.16794v1)

Summary

Black-box Optimization of LLM Outputs by Asking for Directions

Introduction and Motivation

Methodology: Query-based Black-box Optimization

Problem Formulation

Optimization Signal Extraction

Perturbation Strategies

Experimental Results

Adversarial Examples for Vision-LLMs

Targeted Attacks

Prompt Injection Attacks

Jailbreak Attacks

Analysis and Implications

Security Paradox

Failure Modes

Advanced Optimization and Defenses

Practical and Theoretical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Black-box Optimization of LLM Outputs by Asking for Directions — Explained Simply

Overview

What Questions Did the Researchers Ask?

How Did They Do It?

What Did They Find?

Why Is This Important?

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

YouTube

alphaXiv