Papers
Topics
Authors
Recent
Search
2000 character limit reached

Poisoned Human Feedback Attacks

Updated 5 March 2026
  • Poisoned human feedback is the deliberate manipulation of human-generated preference data to subvert alignment in machine learning models.
  • Attackers employ label flipping, trigger-based backdoors, and clean-label feature collisions to achieve high success rates even with under 1% data poisoning.
  • Defensive strategies include gradient-based analysis, robust reward learning, and meticulous auditing, though detecting stealthy, adaptive attacks remains challenging.

Poisoned human feedback denotes the introduction of adversarial or manipulated samples into datasets used for aligning machine learning models—particularly through human-generated preferences, rankings, or upvotes—with the goal of subverting, biasing, or inserting backdoors into the resulting system. This threat is now recognized as a major vulnerability for reinforcement learning from human feedback (RLHF), direct policy optimization (DPO), federated instruction tuning (FIT), and related paradigms for aligning large language and multimodal models with human preferences (Rando et al., 2023, Pathmanathan et al., 2024, Hilel et al., 3 Jul 2025, Duan et al., 3 Jun 2025, Zhao et al., 28 Feb 2026). Poisoned feedback manipulates discrete or scalar preference data, enabling sample-efficient attacks that may persist despite standard robustness measures, and can achieve targeted control over model behavior.

1. Threat Models and Attack Surfaces

Attacks leveraging poisoned human feedback operate under a variety of threat models, typically characterized by the adversary's control over feedback data, limited direct access to the training pipeline, and targeted manipulation objectives.

  • Preference Poisoning: The adversary locally injects, modifies, or flips human preference data—such as upvotes, pairwise comparisons, or rankings—on candidate model outputs. In RLHF pipelines, the attacker may be an annotator or an external user submitting manipulated instructions, feedback, or synthetic samples (Wang et al., 2023, Hilel et al., 3 Jul 2025).
  • Trigger-based Backdoor Attacks: The attacker appends specific triggers (keywords, tokens, or features) to a subset of prompts or data points and manipulates labels so the model learns to activate harmful or unwanted behavior in response to these triggers (Rando et al., 2023, Chen et al., 2024).
  • Clean-label Attacks: The adversary submits data that looks benign to human annotators but is adversarial in feature space, as in multi-modal settings using reward-model feature collisions (Duan et al., 3 Jun 2025).

The fraction of poisoned data required is often remarkably low: successful backdoors have been instantiated with as little as 0.5–1% of the training corpus manipulated in DPO or RLHF settings (Pathmanathan et al., 2024, Chen et al., 2024, Baumgärtner et al., 2024). Black-box user-poisoning with only API access is possible, with a single user upvoting/downvoting outputs sufficient to inject persistent knowledge or vulnerabilities post-alignment (Hilel et al., 3 Jul 2025).

2. Poisoning Mechanisms and Methodologies

Attack methodologies vary by alignment pipeline, preference representation, and modeling choices, but consistently exploit the learning objectives and reward-induced policy changes.

  • Label Flipping: The adversary flips preference labels on selected data, for example, making a harmful or longer output "preferred." In RLHF, this directly alters the reward model and thus the aligned policy (Wang et al., 2023).
  • Targeted Sample Selection: Attacks may employ sophisticated selection heuristics, such as DPO-score maximization, semantic diversity, or gradient projection to target high-influence points for poisoning (Pathmanathan et al., 2024). Rank-poisoning methods use length or content-based heuristics, combined with stealthiness filters (Wang et al., 2023), while selection-based approaches filter for samples that maximize reward but induce undesirable behaviors (Chen et al., 2024).
  • Clean-Label Feature Collisions: In text-to-image RLHF, clean-label poisoning leverages feature collisions in embedding space, crafting inputs visually indistinguishable from benign examples but aligned in representation with a malicious target (Duan et al., 3 Jun 2025).
  • Reinforcement-in-the-loop Amplification: RLHF and DPO amplify the effect of poisoned batches during policy optimization, especially if reward models are manipulated or the attack data is semantically diverse (Nika et al., 13 Mar 2025).

The following table summarizes key attack mechanisms and their targeted alignment methods:

Attack Mechanism Alignment Paradigm Characteristic
Label flipping/backdoor RLHF, DPO Triggers or policy-mismatched preferences
Upvote/downvote control Preference tuning Inject factual, stylistic, or behavioral bias
Clean-label feature attack Multimodal RLHF Feature-space manipulation, no label change
Semantic/diversity heuristics RLHF, DPO Targeted, high-impact samples

3. Impact and Empirical Findings

Quantitative and qualitative results across studies reveal the efficacy, stealth, and transferability of poisoned human feedback attacks:

4. Defenses and Detection Strategies

Research into robust alignment has resulted in methodological and architectural defenses, though no approach offers complete resilience:

  • Gradient and Frequency-Domain Analysis: ProtegoFed demonstrates that clustering frequency-domain representations of per-example gradients can isolate poisoned samples (recall 92–100%), enabling effective purification in federated pipelines (Zhao et al., 28 Feb 2026). This generalizes to centralized RLHF by pre-filtering reward-model gradients.
  • Redundant and Outlier-Resistant Preference Aggregation: Collection of multiple independent rankings per example, median/majority aggregation, and outlier detection for annotator behavior can dilute attacker impact (Wang et al., 2023).
  • Robust Reward Learning: Adversarial training, differential privacy, trimmed losses, and influence-weighting reduce susceptibility of reward models to perturbed labels, with certain methods offering sample-complexity improvements (Nika et al., 13 Mar 2025, Pathmanathan et al., 2024).
  • Sample Auditing and Provenance: Manual auditing of high-reward or out-of-distribution samples, and enforcing provenance tagging or cryptographic validation of collected feedback, prevent wholesale injection (Chen et al., 2024, Duan et al., 3 Jun 2025).
  • Input and Trigger Auditing: Systematic auditing of models for abnormal behavior under candidate triggers or rare tokens can surface hidden backdoors before deployment (Rando et al., 2023).

Current limitations include requirement for majority clean data (in federated/barrier methods), extra computational burden (frequency-based approaches), and the challenge of detecting stealthy attacks that mimic non-poisoned data distributions or target feature representations instead of labels.

5. Theoretical Analysis: Sample Complexity and Paradigm Vulnerabilities

Theoretical work models the policy-teaching capacity of poisoned preference data in RLHF and DPO (Nika et al., 13 Mar 2025), framing the attack as a minimization over the number of poison samples D|D| subject to policy closeness πLπ1ϵ||\pi^L - \pi^\dagger||_1 \leq \epsilon. Main findings:

  • RLHF: For unregularized RLHF, the number of poison samples required to teach a deterministic policy scales as O(1/ϵ2)O(1/\epsilon'^2), with low teaching dimension.
  • DPO: DPO requires O(1/θθμ2)O(1/\|\theta^\dagger - \theta_\mu\|^2) samples, which can be much larger when the target policy is far from the reference, indicating higher robustness than vanilla RLHF.
  • Direct Comparisons: DPO’s inherent coupling of policy parameter regularization (via 2\ell_2 balls) and reference distribution yields higher sample complexity for off-manifold attacks, while split-phase RLHF allows more efficient manipulation of the reward landscape.
  • Assumptions: Attacks are easier when the adversary can synthesize arbitrary preferences, has knowledge of feature representations, and clean data is sparse.

These results align with empirical evidence of DPO’s susceptibility to targeted attacks (0.5%–1% poisoning), but also demonstrate that random poisoning is far less effective unless the poisoning rate exceeds several percent (Pathmanathan et al., 2024).

6. Open Challenges and Research Directions

Despite emerging defenses, research identifies persistent vulnerabilities and open questions:

  • Adaptive and Stealthy Attacks: Adversaries may design poisons that mimic gradient/frequency profiles of clean samples, undermining gradient-based defenses (Zhao et al., 28 Feb 2026).
  • Federated and Distributed Attacks: In cross-silo or federated learning, attackers can poison disjoint clients, yielding high local poisoning fractions even if global frequency is low; collaborative defenses are needed (Zhao et al., 28 Feb 2026).
  • Modal and Pipeline Generalization: Clean-label feature-space poisoning evades label-based or text-only analysis, particularly in multi-modal reward learning (Duan et al., 3 Jun 2025). Generalization to unseen or synonym triggers remains partially effective.
  • Impact Quantification: The influence of small poisoned subpopulations (including individual users) on production-scale model behavior remains incompletely characterized (Hilel et al., 3 Jul 2025).
  • Robust Preference Learning: The search for certified-robust reward models, sample-efficient defense algorithms, and feedback pipelines resilient to both white-box and black-box adversaries is ongoing (Pathmanathan et al., 2024, Nika et al., 13 Mar 2025).
  • Auditable Alignment: Integration of real-time auditing, explainable alignment monitoring, and provenance tracing are proposed to bolster future pipelines.

7. Practical Implications and Recommendations

Given the ubiquity of RLHF, DPO, and preference-tuning for LLMs, as well as the increasing frequency of open and federated feedback collection, poisoned human feedback constitutes a practical and general threat vector for alignment pipelines (Baumgärtner et al., 2024, Hilel et al., 3 Jul 2025). Defensive recommendations include:

  • Multilayered defense (gradient/profile + sample + annotator-based).
  • Restrictive admission or verification procedures for human preference data.
  • Ongoing auditing of model behavior under known and generated trigger conditions.
  • Diverse and cross-validated reward models in RLHF.
  • Explicit modeling of adversarial annotator or user behavior in training.
  • Rigorous separation of reward-model and supervised fine-tuning datasets (Baumgärtner et al., 2024).

Continued research into both attacks and defenses is required to ensure alignment pipelines are robust against the systemic risks posed by poisoned human feedback.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Poisoned Human Feedback.