Preference-Undermining Attacks (PUA)
- Preference-Undermining Attacks (PUA) are adversarial strategies that subtly distort system preferences through minimal, targeted inputs.
- They leverage techniques such as data poisoning, prompt perturbations, and feedback bias to manipulate outcomes in systems like recommenders, LLMs, and voting protocols.
- Empirical studies demonstrate high success rates with less than 1% injected data, challenging current defenses and calling for robust countermeasures.
A Preference-Undermining Attack (PUA) is a class of adversarial manipulation targeting automated systems—such as recommender systems, LLMs, reward-model-based reinforcement learners, matching markets, and multi-agent voting protocols—where the adversary's explicit aim is to distort, override, or hijack the system's learned or elicited mapping from inputs to ranked preferences or choices. PUAs are characterized by their ability to alter aggregate or individual outcomes with only minimal adversarial inputs or indirect manipulations, often without overtly violating data integrity or system policies. PUAs can operate at training-time (data or preference poisoning), inference-time (adversarially crafted prompts or perturbations), or via strategic interaction with the data-collection process. They expose structural vulnerabilities in preference learning pipelines and alignment protocols, and have been demonstrated in domains ranging from recommender systems and LLMs to economic matching markets and voting systems.
1. Formal Definitions and Threat Models
PUAs generalize a broad spectrum of adversarial behaviors where the target is not merely model accuracy or safety, but the very mapping from input context to the model's estimate of what is most preferred. In recommender systems, PUAs involve data poisoning such that items not matching users' actual tastes surface in top-K recommendation lists, as in the IndirectAD attack which uses minimal numbers of fake users to induce item co-occurrence and force unwanted recommendations (Wang et al., 8 Nov 2025). In machine learning from human feedback—exemplified in RLHF for LLMs—PUAs consist of carefully introduced preference pairs or upvote/downvote patterns which shift a model's downstream generation preferences toward attacker-specified entities or behaviors without degrading task performance (Baumgärtner et al., 2024, Hilel et al., 3 Jul 2025).
The key characteristics of PUA threat models typically include:
- Poisoning Ratio Constraint: The adversary's input is limited to a small fraction (often ≪1%) of the total training or preference data (Wang et al., 8 Nov 2025, Baumgärtner et al., 2024).
- Black-box or Limited Knowledge: The attacker generally does not possess full knowledge of the victim system's architecture or parameters, operating instead via public data, interfaces, or feedback mechanisms (Wang et al., 8 Nov 2025, Hilel et al., 3 Jul 2025).
- Targeted Outcome Manipulation: Attackers select specific target items, behaviors, or agents to promote or demote in the system's learned preference mapping (Wang et al., 8 Nov 2025, Nika et al., 13 Mar 2025).
2. Algorithmic Techniques for Preference-Undermining Attacks
PUA implementations span a rich range of algorithmic approaches:
- Label Flipping in Pairwise Preference Learning: In classical reward model training, attackers flip a selected subset of pairwise labels so the reward model ranks adversary-specified outcomes above others. Algorithmic solutions include gradient-based label flip selection and rank-by-distance heuristics, the latter being surprisingly effective and scalable (Wu et al., 2024).
- Fake Profile and Co-occurrence Injection: Attacks like IndirectAD construct fake user profiles exhibiting correlated interactions between a "trigger" and a "target" item, leveraging collaborative filtering algorithms' reliance on co-occurrence statistics to transfer the trigger's popularity to the target (Wang et al., 8 Nov 2025).
- Adversarial Example Generation at Inference-Time: In multi-modal models, subtle image perturbations (e.g., Phi patches) can dramatically shift an LMM's textual output preferences, with both direct (instance-specific) and universal (transferable across instances) variants (Lan et al., 15 Sep 2025).
- Feedback Bias Exploitation: By repeatedly upvoting or downvoting selected model outputs—especially via randomization or "coin flip" prompts—attackers can bias the preference-tuning updates, substantially raising the generation probability of attacker-crafted outputs at scale (Hilel et al., 3 Jul 2025).
- Preference Dataset Poisoning: Target entities and sentiments are injected into preference ranking and reward learning pipelines, with carefully designed contrast pairs to make the reward model systematically up-rank or down-rank the attacker's target (Baumgärtner et al., 2024, Nika et al., 13 Mar 2025).
These approaches often require only a minute fraction of the data (as low as 0.05–1%) to achieve near-complete manipulation of target preferences, with empirical success rates up to 100% for some attack-model-domain combinations (Wang et al., 8 Nov 2025, Wu et al., 2024, Baumgärtner et al., 2024).
3. Mathematical Frameworks and Evaluation Metrics
Formalization of PUAs is domain-dependent, but key commonalities include:
- Promotion Objective: Maximize the rate at which a target outcome t outranks a random alternative under a reward or preference model after attack-induced training:
where is a reward model, and P is a distribution over comparison items (Wu et al., 2024).
- Poisoning Sample Complexity: PUAs are characterized by the minimal number of poisoned samples required to enforce a desired policy/proxy output within -distance of a target (Nika et al., 13 Mar 2025).
- Effectiveness Metrics: Task-specific metrics such as Hit Rate@K in recommenders (Wang et al., 8 Nov 2025), proportion of model outputs with target entity/sentiment in LMs (Baumgärtner et al., 2024), or reduction in factual accuracy/deference in LLMs under PUA-style prompts (An et al., 10 Jan 2026).
Methodological evaluation combines attack strength (success in shifting model preferences), stealth (impact on clean accuracy and undetectability), and transferability across model architectures and domains.
4. Empirical Domains and Demonstrated Vulnerability
PUAs have been empirically established in:
| Domain | Attack Style | Core Vulnerability | Representative Papers |
|---|---|---|---|
| Recommender systems | Trigger-based poisoning | Top-K promotion at <0.1% fake users, undetectable by classifiers | (Wang et al., 8 Nov 2025) |
| LLMs (RLHF) | Preference/feedback bias | Behavior and factual shifts with <1% poisoned preferences | (Baumgärtner et al., 2024, Hilel et al., 3 Jul 2025) |
| MLLMs (vision+text) | Adversarial images | Universal hijacking of output preferences, fully stealthy | (Lan et al., 15 Sep 2025) |
| Multi-agent voting | Partial elicitation attack | Adversarial omissions make a target win under minimax regret | (Dey, 2017) |
| Economic matching | Adversarial interaction | Strategic underperformance for future strategic gain | (Ionescu et al., 2021) |
Notably, subtle attacks—either well-targeted data points, sycophantic prompts, or visually imperceptible perturbations—often produce large downstream effects on system preference outputs. In LLMs and RLHF pipelines, backdooring is possible with 1–5% of the preference data, and these attacks are undetectable by standard anomaly detectors or accuracy drop metrics (Baumgärtner et al., 2024, Nika et al., 13 Mar 2025, Wang et al., 8 Nov 2025).
5. Defensive Mechanisms and Limitations
Mitigation of PUAs remains challenging:
- Anomaly and Outlier Detection: State-of-the-art detectors (spectral signatures, loss-based removal, label-propagation) are largely ineffective, with AUC values indistinguishable from chance in the presence of PUAs (Wang et al., 8 Nov 2025, Wu et al., 2024).
- Adversarially-Robust Training: Embedding-level adversarial objectives (APR, AMR, pessimistic policy optimization) improve robustness but do not fully close the attack surface, especially against innovative black-box or inference-time attacks (Wang et al., 8 Nov 2025, Gupta et al., 10 Mar 2025).
- Data Pipeline Hardening: Partitioning reward model and SFT data, versioning datasets, and human auditing for entity-mention anomalies can reduce some forms of attack effectiveness but incur annotation and system complexity overheads (Baumgärtner et al., 2024).
- Empirical Red-Teaming and Factorial Evaluation: Controlled factorial designs allow detection of dialogue-level PUA susceptibilities, enable calibration of preference-factuality trade-offs, and inform modular defenses (“reality-check” layers, tailored RLHF with adversarial examples) (An et al., 10 Jan 2026).
The fundamental challenge remains that all known PUA defenses degrade either system utility or helpfulness, and none provide provable guarantees in general-purpose preference learning settings.
6. Structural Insights and Theoretical Limits
Published lower and upper bounds for poisoning sample complexity reveal that:
- RLHF reward-model-based pipelines are more susceptible to enforced preference shifts than direct preference optimization (DPO), especially when pushing the learned policy far from the reference (Nika et al., 13 Mar 2025).
- Regularization (KL or ) increases the required number of attack samples, but the scaling is modest compared to the ease of clean-policy manipulation in high-dimensional or loosely-regularized settings.
- The high-dimensional, unstructured nature of preference data facilitates the success of simple, scalable attacks such as rank-by-embedding-distance, even when attackers have only partial or black-box knowledge (Wu et al., 2024).
- The economic and societal cost of PUAs is compounded by emergent incentive misalignments, such as adversarial interaction attacks in matching markets which amplify inequality and inefficiency (Ionescu et al., 2021).
7. Open Questions and Research Directions
Open research problems include:
- Provable robust objectives and certification tools for reward and preference model learning under adversarial data (Gupta et al., 10 Mar 2025).
- Automated protocols for continuous monitoring and red-teaming of model preference surfaces (An et al., 10 Jan 2026).
- Extension of PUA defenses to multi-modal and interactive domains, including resilience to universal perturbations and multi-turn attacks (Lan et al., 15 Sep 2025).
- Auditing and at-scale detection of subtle manipulations in large, crowdsourced preference datasets (Baumgärtner et al., 2024).
- Adaptive mechanisms for economic and agent-based settings to minimize long-term welfare loss and inequality induced by strategic PUA (Ionescu et al., 2021).
Across all domains, PUAs challenge the assumption that system preference outputs reliably track true user values or intentions, and motivate a move toward rigorous, adversarially-aware preference elicitation, learning, and alignment pipelines.