Black-box Optimization of LLM Outputs by Asking for Directions (2510.16794v1)
Abstract: We present a novel approach for attacking black-box LLMs by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this link.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Black-box Optimization of LLM Outputs by Asking for Directions — Explained Simply
Overview
This paper is about finding a new way to “trick” LLMs like GPT or Claude when you only see their text replies. Instead of using secret internal data from the model (which most people don’t have), the authors show how to attack these systems by simply asking them to compare two options and say which one is “closer” to a goal. Think of it like playing the “hot or cold” game to find a hidden object: you keep asking which step is warmer, and use that to move in the right direction.
What Questions Did the Researchers Ask?
The paper asks three main questions, in simple terms:
- Can we attack LLMs that only show text replies by asking them to compare two inputs and tell us which is better?
- Will this work across different types of attacks, like:
- Making image+text models mislabel pictures,
- Getting LLMs to bypass safety rules (jailbreak),
- Sneaking harmful instructions into data that LLMs process (prompt injection)?
- Are bigger, smarter models harder or easier to attack with this method?
How Did They Do It?
The authors use a method that’s like climbing a hill step-by-step (called “hill climbing”):
- Start with an input (an image or a prompt).
- Make a small change to it (a “perturbation”)—like nudging a few pixels in a picture, or editing a few words in a suffix.
- Ask the model a simple comparison question: “Which one is closer to making the target happen?” For example:
- For images: “Which image is less likely to be a dog?”
- For jailbreaks: “Which prompt is more likely to get the model to say ‘Sure’ at the start?”
- For prompt injection: “Which instruction is more likely to trigger sending an email?”
- If the model picks the new version, keep it and repeat. This is like asking for directions at every step: take a step, ask if you're getting warmer, and keep going.
Important idea: The authors discovered that LLMs are bad at giving exact confidence numbers (like “I’m 95% sure”), but they’re much better at saying which of two choices is better. So instead of asking for a score, they ask for a preference. That preference becomes the “signal” to guide the attack.
They tested this on:
- Vision-LLMs (models that look at images and answer in text),
- Jailbreaks (bypassing safety to get harmful responses),
- Prompt injections (hiding instructions in content to trigger actions).
They only used text outputs—no secret model data, no extra helper models.
What Did They Find?
Here are the key results, explained simply:
- Comparison works, scoring doesn’t: When asked for exact confidence numbers, models gave poor, unhelpful answers (like always saying 0%, 50%, or 99%). But when asked “Which of these two is better?”, models gave reliable guidance often enough to make progress.
- It works in text-only settings: Even without access to internal scores or log probabilities, their method achieved strong attack success rates across tasks and models. In many cases, it matched or came close to methods that use more insider information.
- Bigger models are easier to attack with this method: Surprisingly, larger and more capable models (like stronger versions of GPT or Llama) are better at making comparisons, which makes them more vulnerable to this type of attack. This is a “security paradox.”
- Low query counts: The attacks typically needed a modest number of tries (often somewhere between about 5 and a few hundred queries) to succeed.
Why Is This Important?
This matters because it widens the attack surface of real-world AI systems. Many companies try to hide internal data to make attacks harder. But this method shows that you can still steer the model using just its text replies—no secret access needed. It also warns that improving a model’s reasoning and honesty (calibration) can unintentionally make it easier to attack through careful comparisons.
Implications and Impact
- For AI developers: Simply hiding confidence scores or limiting APIs to text isn’t enough to stop optimization-based attacks. Models that are better at explaining themselves can be more vulnerable when attackers use that self-awareness against them.
- For safety teams: Possible defenses include teaching models to refuse certain comparative questions that could aid attacks, limiting feedback that helps iterative optimization, and detecting repeated, similar queries that look like “hill climbing.”
- For the future: As LLMs get stronger, this approach could be applied to more systems and tasks, making security even more important.
In short, the paper shows a clever, simple idea: you can “ask for directions” from an LLM to climb toward a harmful goal. And because LLMs are good at comparing options, this works well—even when you only see text and nothing else.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps the paper leaves unresolved, each phrased to be actionable for future research.
- External validity across tasks and settings: Evaluate the method on a broader set of real-world applications (e.g., multi-turn agents with tool use/browsing/RAG, code assistants, medical/legal QA) beyond ImageNet, AgentDojo, and AdvBench.
- Scale and statistical rigor: Increase sample sizes and report confidence intervals, variance across seeds, and run-to-run stability to substantiate reported ASRs and comparative trends.
- Comparative-prompt sensitivity: Systematically paper how wording, ordering, and formatting of binary comparison prompts affect refusal rates, calibration, and attack success across models and providers.
- Determinism and sampling effects: Quantify the impact of decoding parameters (temperature, top-p, nucleus sampling), response randomness, and repeated querying on the reliability of pairwise preferences.
- Robust response parsing: Develop and evaluate robust parsing protocols (e.g., regex with fallbacks, structured tool calling, self-check prompts) for extracting binary decisions when models deviate from “0/1” instructions.
- Preference calibration metrics: Move beyond anecdotal plots and benchmark comparative calibration using standard metrics (e.g., accuracy vs. logit margin, AUC, Brier/ECE-style measures adapted to pairwise labels) across models and domains.
- Noise-aware optimization: Model and estimate a noise profile for the preference oracle (false positive/negative rates vs. margin) and design algorithms provably robust to such noise (e.g., dueling bandits with noisy comparisons, repeated-majority queries).
- Query efficiency theory: Provide theoretical analysis of convergence and query complexity for hill-climbing with noisy LLM preferences, including conditions that guarantee improvement and bounds under realistic noise models.
- Beyond binary comparisons: Explore k-ary tournaments, preference learning, and Bayesian optimization with pairwise feedback to reduce queries and avoid local optima.
- Better perturbation strategies: Replace random token swaps and Square Attack-like sampling with structured search (CMA-ES, evolutionary strategies, gradient estimation via preferences, RL with pairwise rewards) to improve query efficiency.
- Success detection formalization: Define and evaluate robust success criteria in text-only settings (beyond ad hoc “Sure”-prefix or single-judge labels), including multi-judge consensus, calibrated reward models, or task-grounded validators.
- Judge reliability and bias: Quantify agreement between automatic judges and human raters for jailbreak harmfulness; assess cross-model bias (same-vendor judging), and measure false positives/negatives of harm detection.
- Vision-LLM evaluation validity: Validate that “Does this image contain X?” queries faithfully reflect classification decisions; compare against logit/decoder evidence where possible and assess brittleness to prompt phrasing.
- Perceptual imperceptibility: For image attacks with ε = 32/255, conduct human perceptual studies or use perceptual metrics (e.g., LPIPS) to confirm imperceptibility; evaluate across display/codec/resize pipelines.
- Physical-world robustness: Test adversarial images in physical or end-to-end capture settings (print–scan, screen–camera) to assess real-world viability.
- Targeted attack decomposition: For “contain target but not source,” investigate alternative comparative decompositions (e.g., two binary queries, margin-style questioning) and curriculum prompting to aid weaker models.
- Agentic threats breadth: Extend prompt-injection evaluation beyond 14 types and 4 tasks to diverse tools (filesystem, web, cloud APIs), longer workflows, and high-impact actions (data exfiltration, financial transactions).
- Transferability analysis post-optimization: Measure how inputs optimized via introspection on one model transfer to others, and whether preference-guided search overfits target-specific quirks.
- Cross-model/family controls: Rigorously test the “security paradox” (larger models more vulnerable) under controlled settings within a family (same safety layer, decoding, temperature) to separate capability vs. policy confounds.
- API and policy constraints: Assess feasibility under realistic rate limits, cost models, caching, and safety monitors; quantify total dollar/time cost and detectability under production telemetry.
- Stealth and detectability: Design and evaluate stealthier query schedules (spacing, camouflage tasks, paraphrasing) and measure detection by stateful filters or anomaly detectors.
- Refusal circumvention: Develop principled ways to elicit useful preferences despite safety refusals (e.g., indirect questioning, meta-preferences, chain-of-critique) without violating policies.
- Defense effectiveness trade-offs: Empirically test proposed defenses (confidence-expression suppression, randomized refusals, response smoothing, preference-noise injection) for both attack reduction and utility degradation.
- Adversarial training on preferences: Investigate whether training models to be unreliable (or intentionally noisy) on adversarial comparative queries reduces attackability while preserving benign comparative utility.
- Universal adversarial artifacts: Explore whether preference-guided optimization can yield universal adversarial suffixes or image perturbations transferable across inputs and models.
- Language and modality coverage: Test multilingual attacks and extend to other modalities (audio, video, speech-to-text) to assess generality of preference-guided optimization.
- Context and system-prompt robustness: Evaluate attacks when targets are wrapped by system prompts, content filters, or tool-spec JSON schemas typical of production apps.
- Minimal-edit and fluency constraints: For text attacks, measure the minimal suffix length/edit distance required for success and assess human detectability/fluency impacts.
- Reproducibility and artifacts: Release code, prompts, and seeds; specify API versions, temperatures, and refusal handling to enable replication and longitudinal testing as providers update models.
- Comparison to contemporary baselines: Benchmark against recent query-only attacks (e.g., QROA, MIST) under identical budgets to isolate the gains attributable to comparative introspection.
- Long-horizon optimization: Study performance over longer budgets (beyond 1,000 queries), including convergence behavior and diminishing returns, to inform practical attack planning.
- Safety side effects of “harmless” comparisons: Quantify whether reframing as comparisons systematically bypasses safety layers more than direct optimization, and whether this generalizes across providers.
Practical Applications
Immediate Applications
Below are actionable applications that can be deployed now, derived from the paper’s comparative-confidence, binary-preference optimization method for text-only black-box LLMs.
- Bold Red Teaming for LLM APIs and Products — sectors: software, platform security, content platforms
- Tools/products/workflows: implement a “Binary-Preference Query Fuzzer” that iteratively asks the target model which of two candidate inputs is closer to an attack goal (e.g., jailbreak, prompt injection), then hill-climbs; integrate into CI/CD for pre-release and periodic audits; produce risk reports and attackable-surface maps.
- Assumptions/dependencies: text-only access to target model responses; model permits benign comparative queries; sufficient query budget; legal and ethical authorization to test; logging of queries and outcomes.
- Automated Prompt-Injection Assessment for LLM Agents — sectors: enterprise software, RPA, customer support, data analytics
- Tools/products/workflows: adapt AgentDojo-like scenarios for internal “function-call misuse audits”; feed structured tasks with embedded adversarial suffixes; use binary comparison queries to optimize adversarial suffixes that trigger unauthorized actions; add a “Function-Call Approval Gate” with provenance checks.
- Assumptions/dependencies: instrumented agent frameworks (function-calling, audit logs); ability to evaluate malicious action success; sandboxed environment; detection and rate limits tuned to allow controlled testing.
- Jailbreak Robustness Testing of Safety-Aligned Models — sectors: healthcare, education, enterprise chatbots, safety-critical content generation
- Tools/products/workflows: “Jailbreak Surface Scanner” that discovers effective adversarial suffixes via binary comparisons; judge models or policy heuristics to score harmful content; integrate findings into alignment updates and guardrails.
- Assumptions/dependencies: policy-compliant testing contexts; access to judge models or clear harmfulness criteria; model not hard-refusing all comparative tasks.
- Vision-LLM Adversarial Robustness Audits — sectors: e-commerce (product recognition), accessibility tools, content moderation, document processing
- Tools/products/workflows: “Text-only VLM Adversarial Audit” using Square Attack-style perturbations guided by binary comparisons; evaluate untargeted and targeted misclassification risk; optionally seed with transfer-based priors to reduce query counts.
- Assumptions/dependencies: ability to submit two images in a single comparative query; domain-appropriate perturbation bounds ( constraints); models that answer visual comparative questions without refusal.
- Stateful Query Anomaly Detection and Rate-Limiting — sectors: API providers, platform security, cloud SaaS
- Tools/products/workflows: deploy “Iterative Optimization Detector” that flags runs of highly similar queries indicative of hill-climbing; throttle or challenge such sessions; optionally block “which of these two inputs is more likely…” patterns for unsafe domains.
- Assumptions/dependencies: session-level logging; heuristics tuned to minimize false positives; privacy and compliance controls.
- Alignment Policy Updates to Refuse Comparative Optimization on Unsafe Goals — sectors: model providers, safety teams
- Tools/products/workflows: update instruction-tuning datasets to include refusal behaviors for comparative optimization on harmful objectives (jailbreaks, injections); add pre-response filters to detect comparison tasks that could reveal optimization signals.
- Assumptions/dependencies: alignment changes maintain utility on allowed comparative tasks; monitoring to prevent capability regressions in legitimate comparisons.
- Compliance and Risk Governance for AI Features — sectors: regulated industries (healthcare, finance, government), enterprise risk management
- Tools/products/workflows: add text-only red-team evaluations to security controls (SOC2/ISO/AI governance); periodic adversarial assessments; document defense efficacy against query-based optimization; vendor due diligence checklists include binary-preference attack resilience.
- Assumptions/dependencies: executive buy-in; policy frameworks accept such testing as evidence; budget for continuous red teaming.
- Developer Best Practices and Education — sectors: software engineering, product management; daily life for power users of LLM tools
- Tools/products/workflows: publish “Prompt Injection Hygiene” guidelines (sanitize untrusted inputs, segregate data from instructions, avoid asking models to compare candidate prompts on action likelihood); create checklists for agent builders.
- Assumptions/dependencies: teams adopt and enforce practices; tooling support (context isolators, prompt firewalls).
- Input Provenance Isolation for Agents — sectors: enterprise apps, knowledge management
- Tools/products/workflows: “Input Isolator” that strictly separates untrusted content from system instructions; robust context segmentation; explicit allowlists for action-triggering tokens; minimal coupling of content and control prompts.
- Assumptions/dependencies: agent framework supports provenance metadata; performance impact acceptable.
- A/B Defense Evaluation with Comparative Queries — sectors: model providers, safety research
- Tools/products/workflows: run with/without comparative-refusal defenses; measure attack success rate (ASR), mean queries to success, and false refusals in benign tasks; use data to tune refusal criteria.
- Assumptions/dependencies: reproducible test suites; judge models or rules to score harmful outputs.
Long-Term Applications
The following applications require further research, scaling, or cross-organizational development to become robust and broadly deployable.
- Counter-Optimization Training and Selective De-Calibration — sectors: model providers, safety research
- Tools/products/workflows: train models to degrade comparative calibration specifically on unsafe objectives while preserving general-task calibration; incorporate adversarial training on comparative prompts; build “attackability score” metrics for release gating.
- Assumptions/dependencies: methods avoid broad utility loss; reliable classifiers to distinguish malicious vs benign comparative intents; robust evaluation suites.
- Cross-Session Iterative-Attack Detection and Privacy-Preserving Telemetry — sectors: API platforms, cloud security
- Tools/products/workflows: advanced stateful detection across accounts/IPs to spot slow, distributed hill-climbing; privacy-preserving aggregation; “Query-Shaping” that adds randomization to weaken optimization signals on unsafe domains.
- Assumptions/dependencies: regulatory compliance for telemetry; scalable storage; acceptable latency; careful trade-offs to avoid breaking legitimate use.
- Secure Agent Architectures with Capability Firewalls — sectors: robotics, industrial automation, healthcare IT, finance ops
- Tools/products/workflows: provenance-aware policy engines, action gating, multi-channel verification before high-risk function calls; fallbacks to non-LLM controllers; “semantic sanitizers” that scrub or compartmentalize untrusted inputs.
- Assumptions/dependencies: system redesign to separate planning from acting; human-in-the-loop escalation; safety cases for regulators.
- Universal Adversarial Suffix and Image Perturbation Defenses — sectors: content platforms, moderation, VLM products
- Tools/products/workflows: detectors for generative adversarial suffix families and imperceptible visual perturbations; robust training with synthetic attacks; “negative example generators” that strengthen guardrails against families of comparative-optimized attacks.
- Assumptions/dependencies: scalable synthesis of adversarial data; minimal false positives; cross-model generalization.
- Sector-Specific Resilience Programs
- Healthcare: audits for clinical assistants and EHR summarizers to prevent injection-driven data exfiltration; “PHI-protect agents” with strict function gating and context isolation.
- Finance: trading and compliance assistants with safe-mode defaults; dual-control workflows for actions triggered by untrusted inputs; adversarial testing embedded in model risk management.
- Robotics/IoT: LLM-driven controllers with layered safety nets and certified fallback policies; visual-perception robustness testing against text-guided VLM attacks.
- Assumptions/dependencies: domain regulations; certification pathways; cross-vendor interoperability; budget for continuous evaluation.
- Standardized Benchmarks and Certification for Text-Only Adversarial Robustness — sectors: standards bodies, regulators, procurement
- Tools/products/workflows: “Text-only Adversarial Robustness Benchmarks” (VLM and LLM) with ASR, mean query budgets, refusal quality; third-party certification; procurement requirements for public-sector deployments.
- Assumptions/dependencies: community consensus; validated scoring protocols; compatibility with existing AI assurance frameworks.
- Cyber Insurance and AI Risk Underwriting — sectors: insurance, enterprise risk
- Tools/products/workflows: underwriting models that factor attackability via binary-preference methods; premium adjustments based on demonstrated defenses; periodic re-assessments.
- Assumptions/dependencies: accepted, standardized testing; access to audit artifacts; alignment with legal frameworks.
- Cryptographic Attestation and Watermarking of Contexts — sectors: platform security, enterprise apps
- Tools/products/workflows: sign and verify trusted instruction blocks; watermark system prompts; enforce strict separation from user content so comparative attacks can’t exploit mixed contexts.
- Assumptions/dependencies: changes to agent frameworks and APIs; key management; adoption across tooling ecosystems.
- Economic Optimization of Defense vs Attack Costs — sectors: platform economics, engineering management
- Tools/products/workflows: “Attack Cost Estimators” that model queries and time to success under various defenses; defense ROI dashboards; automated policy tuning.
- Assumptions/dependencies: data on attack rates; accurate modeling of user impact; continuous monitoring pipelines.
- Fundamental Research on Calibration and Introspection — sectors: academia, model science
- Tools/products/workflows: studies on why binary comparisons are better calibrated than absolute confidence; new training objectives and evaluation metrics; exploration of the capability–vulnerability paradox and how to decouple them.
- Assumptions/dependencies: access to diverse models; reproducible datasets; collaboration with model providers.
Glossary
- Adversarial examples: Inputs intentionally perturbed to cause a model to make incorrect predictions. "adversarial examples that cause misclassification (untargeted) or force specific incorrect predictions (targeted, e.g., misclassifying a dog image as a fish)."
- Adversarial optimization: The process of iteratively improving adversarial inputs to better achieve an attack goal. "sufficiently calibrated to enable effective adversarial optimization."
- Attack surface: The set of ways an adversary can interact with or exploit a system. "thereby dramatically expanding the attack surface for deployed LLMs."
- Auxiliary models: Additional models used to provide signals or guidance in attacks. "No auxiliary models: Our method does not rely on any surrogate or auxiliary models."
- Binary comparisons: Pairwise evaluations where a model chooses which of two inputs is closer to an objective. "we propose a more effective approach by reformulating the optimization problem as a series of binary comparisons."
- Black-box: A setting where the attacker can query a model but cannot access its internal parameters or outputs beyond text. "attackers face black-box scenarios in which they can only query the model through an API."
- Calibration: The alignment between a model’s expressed confidence and actual correctness. "better and larger models exhibit superior calibration when expressing confidence"
- Chain-of-thought: Step-by-step reasoning tokens produced by an LLM. "self-consistency (the response agreement across multiple independent chain-of-thought samples)---is a good predictor of correctness and improves reliability"
- CLIP models: Vision-LLMs that align images and text in a shared embedding space. "we leverage three CLIP models of varying sizes to generate transferable adversarial examples"
- Comparative confidence: A model’s ability to judge which of two inputs better meets a goal, rather than assign absolute scores. "larger and more capable models tend to be better calibrated when expressing comparative confidence"
- Confidence scores: Numerical indicators of a model’s certainty about its output. "continuous outputs like logits or confidence scores (which are rarely available in practice)"
- Constraint set: The allowable region of perturbations around an input under attack. "where defines the feasible constraint set around the original input ."
- Ensemble attack: A strategy that runs multiple attack variants and selects the best-performing result. "The Ensemble attack, which runs all attack variants and picks the best, achieves the highest success rates across all models"
- Hill-climbing: A local search optimization that iteratively moves to better neighboring candidates. "This comparative capability enables an effective ``hill-climbing'' optimization strategy"
- Imperceptible perturbations: Changes to inputs that are small enough to be unnoticed by humans but affect model decisions. "apply imperceptible perturbations to images to cause misclassification"
- Introspection: A model’s ability to reflect on and express properties of its own outputs (e.g., confidence). "This approach leverages the model's own ``introspection'' capabilities rather than relying on external proxies"
- Jailbreaks: Prompt-based attacks that bypass safety mechanisms to elicit harmful content. "We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections."
- L∞ ball: The set of points within a maximum per-coordinate distance from an input, used to bound perturbations. "we constrain perturbations within an ball to maintain visual imperceptibility:"
- Log-probabilities: The logarithms of token probabilities output by a model, often used for optimization. "access to model log-probabilities"
- Logits: Pre-softmax scores output by a model that indicate confidence for classes or tokens. "continuous outputs like logits or confidence scores"
- Miscalibration: A mismatch between a model’s expressed confidence and the actual likelihood of correctness. "often exhibiting severe miscalibration and collapsing to a small set of stereotypical values (e.g., 0%, 50%, or 99%)."
- Preference-guided search: Optimization driven by a model’s choices among candidate inputs. "This preference-guided search concentrates queries on high-risk inputs and ultimately yields effective attacks."
- Prompt injections: Malicious instructions embedded in data or prompts to cause unintended model actions. "injecting malicious instructions into data processed by LLM-powered agents"
- Proxy signals: Indirect feedback (often from other models) used to guide attacks when direct signals are unavailable. "rely on proxy signals from other models."
- Query-based attacks: Methods that iteratively query the target model to refine adversarial inputs. "Query-based attacks. These attacks perform optimization by repeatedly querying the target model to refine adversarial inputs based on observed responses"
- Query budget: A cap on the number of model queries allowed during an attack. "We fix a query budget of 1,000"
- Safety alignment: Training and policies that steer models away from producing harmful outputs. "Elicit harmful or prohibited responses from safety-aligned LLMs by bypassing their safety mechanisms."
- Self-consistency: Agreement among multiple independent samples of a model’s reasoning, used as a confidence proxy. "self-consistency (the response agreement across multiple independent chain-of-thought samples)---is a good predictor of correctness and improves reliability"
- Square Attack: A black-box adversarial attack that perturbs random image squares under norm constraints. "We adopt the Square Attack~\citep{andriushchenko2020square}"
- Strong alignment defenses: Robust safety mechanisms that cause models to refuse potentially harmful queries. "Strong alignment defenses: Heavily aligned models may refuse to engage with the comparison task entirely, responding with rejection messages such as ``Sorry, I cannot help.''"
- Surrogate model: A secondary model used to craft adversarial examples that may transfer to the target. "optimizing adversarial inputs on a local surrogate model and transferring them to the target"
- Targeted attacks: Attacks aiming to force a specific incorrect prediction or behavior. "We further test our method on more challenging targeted attacks."
- Text-only setting: An interface that returns only natural-language outputs without numeric confidences or internals. "the most challenging and increasingly common scenario is what we term the text-only setting"
- Threat model: Assumptions about an attacker’s capabilities and access when evaluating security. "We consider a realistic black-box threat model in which the attacker has only query access to the target model ."
- Token log-probabilities: Per-token probability logs that quantify model confidence in generated text. "logits, or token log-probabilities"
- Transfer-based attacks: Attacks that craft adversarial inputs on one model and apply them to another. "Transfer-based attacks have a long history in adversarial examples"
- Transferability: The tendency of adversarial inputs to remain effective across different models. "rely primarily on transferability."
- Untargeted attacks: Attacks that aim for any incorrect outcome rather than a specific target. "misclassification (untargeted)"
- Vision-LLMs: LLMs that incorporate vision capabilities to process images. "adversarial examples for vision-LLMs"
- White-box: A setting where the attacker has full access to model parameters, gradients, and internals. "In white-box settings, where attackers have full access to model parameters and gradients"
Collections
Sign up for free to add this paper to one or more collections.
