LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses (2406.04755v3)

Published 7 Jun 2024 in cs.CR, cs.AI, cs.HC, and cs.LG

Abstract: Writing effective prompts for LLMs (LLM) can be unintuitive and burdensome. In response, services that optimize or suggest prompts have emerged. While such services can reduce user effort, they also introduce a risk: the prompt provider can subtly manipulate prompts to produce heavily biased LLM responses. In this work, we show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept (e.g., a brand, political party, nation). We substantiate our observations through a user study, showing our adversarially perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2) push LLMs to recommend target concepts more often, and 3) make users more likely to notice target concepts, all without arousing suspicion. The practicality of this attack has the potential to undermine user autonomy. Among other measures, we recommend implementing warnings against using prompts from untrusted parties.

PDF HTML Abstract

The paper, titled "Sales Whisperer: A Human-Inconspicuous Attack on LLM Brand Recommendations," explores the potential security risks associated with utilizing LLMs for brand recommendations, specifically focusing on the manipulation of prompts to skew LLM responses toward recommending specific brands. The central thesis of the paper is an investigation into how small, seemingly inconspicuous changes to prompts can induce LLMs to favor certain brands without being noticed by human users.

Key Contributions and Findings:

Impact of Prompt Paraphrasing:
- The paper demonstrates that subtle paraphrasing of prompts can lead to significant variations in the probability that an LLM will mention a particular brand. In extreme cases, a change in phrasing can lead to a 100% increase in the likelihood of a brand being recommended.
Human-Inconspicuous Attack:
- The authors introduce an approach to perturb base prompts through synonym replacement, which increases the likelihood of an LLM mentioning a targeted brand by up to 78.3%. These perturbations are designed to be human-inconspicuous, meaning the altered prompts and the subsequent LLM responses remain undetectable to human users engaged in normal interactions.
Threat Model Analysis:
- Various threat models are analyzed where adversaries might suggest crafted prompts to users or infiltrate platforms where prompts are shared, with the intention of biasing LLM brand recommendations for economic gain.
User Study Verification:
- An extensive user paper validates the human-inconspicuous nature of the proposed perturbation methods. The paper shows that users do not significantly perceive the perturbed prompts or responses as biased or targeted, confirming the stealthiness of the attack.
Transferability of Attacks:
- The experiment also investigates the transferability of synonym-replacement attacks to different LLMs, including GPT-3.5 Turbo. It was found that the attack has varying degrees of success across different model architectures, suggesting that some models are more susceptible to this type of attack than others.
Dataset Creation:
- To evaluate their methodologies, the authors created a dataset consisting of 449 prompts across 77 product categories. This dataset serves as a basis to test and validate the synonym replacement approach and its efficacy in skewing LLM recommendations.

Detailed Methodology:

The paper emphasizes the use of "loss-based synonym replacements" as a strategy to perturb prompts without direct access to the LLM's weights. A loss function is computed based on the LLM's logits, aimed at intuitively increasing the LLM's propensity to mention a brand-related term.
The approach does not require extensive computational resources as it relies on synonym replacements guided by a logit-based loss rather than full rephrasing or brute-force testing of numerous prompts.

Numerical Results and Analysis:

The empirical analysis shows that the synonym replacement approach leads to increased brand mention probabilities, with Gemma-it experiencing up to a 52.8% maximum absolute improvement as a demonstration of the strongest effect reported.
The paper suggests that this method could realistically be used in practice by adversaries wanting to surreptitiously promote specific brands through LLM interactions, leveraging unforeseen bias induction in brand recommendation tasks.

In conclusion, this work highlights novel security challenges associated with LLM usage in consumer-facing applications, demonstrating that LLMs can be subtly manipulated through crafted prompts, thereby influencing user choice in perceived brand recommendations. The paper contributes to the broader discourse on LLM security by underscoring the practical implications of prompt-based attacks and the need for vigilant defenses in AI-driven recommendation systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Weiran Lin (6 papers)
Anna Gerchanovsky (1 paper)
Omer Akgul (6 papers)
Lujo Bauer (16 papers)
Matt Fredrikson (44 papers)
Zifan Wang (75 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/wellgrowntrees/status/1902418213859696902

https://twitter.com/FSFG/status/1836188518097330316

https://twitter.com/RecsysPapers/status/1826393854884618639

https://twitter.com/wellgrowntrees/status/1902387961624220025