Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 156 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 168 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization (2403.03419v2)

Published 6 Mar 2024 in cs.CL and cs.AI

Abstract: LLMs have revolutionized the role of AI, yet pose potential social risks. To steer LLMs towards human preference, alignment technologies have been introduced and gained increasing attention. Nevertheless, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy positive responses that are barely distinguishable from negative ones. Given recent LLMs' proficiency in generating helpful responses, this work pivots towards a new research question: can we achieve alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness? For this purpose, we propose Distributional Dispreference Optimization (D$2$O), which maximizes the discrepancy between dispreferred responses and the generated non-negative ones. In this way, D$2$O effectively eschews harmful information without incorporating noisy positive samples, while avoiding collapse using self-generated responses as anchors. We demonstrate that D$2$O can be regarded as learning a distributional preference model reflecting human dispreference against negative responses, which is theoretically an upper bound of the instance-level DPO. Extensive experiments manifest that our method achieves comparable generation quality and surpasses the latest strong baselines in producing less harmful and more informative responses with better training stability and faster convergence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Alex Havrilla. 2023. synthetic-instruct-gptj-pairwise (revision cc92d8d).
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  6. Managing ai risks in an era of rapid progress. arXiv preprint arXiv:2310.17688.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE.
  9. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  10. Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE.
  11. Sneha Chaudhari and Shirish Shevade. 2012. Learning from positive and unlabelled examples using maximum margin clustering. In Neural Information Processing: 19th International Conference, ICONIP 2012, Doha, Qatar, November 12-15, 2012, Proceedings, Part III 19, pages 465–473. Springer.
  12. A variational approach for learning from positive and unlabeled data. Advances in Neural Information Processing Systems, 33:14844–14854.
  13. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  14. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  15. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  16. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
  17. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  18. Convex formulation for learning from positive and unlabeled data. In International conference on machine learning, pages 1386–1394. PMLR.
  19. Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213–220.
  20. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459.
  21. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  22. Declutr: Deep contrastive learning for unsupervised textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895.
  23. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738.
  24. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
  25. Revisiting self-training for neural sequence generation. In International Conference on Learning Representations.
  26. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  27. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  28. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
  29. hiyouga. 2023. Llama factory. https://github.com/hiyouga/LLaMA-Factory.
  30. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  31. The curious case of neural text degeneration. In International Conference on Learning Representations.
  32. Generative adversarial positive-unlabeled learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2255–2261.
  33. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
  34. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  35. A distributional approach to controlled text generation. In International Conference on Learning Representations.
  36. Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735.
  37. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  38. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091.
  39. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  40. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679.
  41. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  42. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960.
  43. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  44. Jennifer Lodi-Smith and Elisabetta Crocetti. 2017. Self-concept clarity development across the lifespan. Self-concept clarity: Perspectives on assessment, research, and applications, pages 67–84.
  45. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609.
  46. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479.
  47. Webgpt: Browser-assisted question-answering with human feedback.
  48. Variational bayesian unlearning. Advances in Neural Information Processing Systems, 33:16025–16036.
  49. A survey of machine unlearning. arXiv preprint arXiv:2209.02299.
  50. Rankme: Reliable human ratings for natural language generation. arXiv preprint arXiv:1803.05928.
  51. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  52. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  53. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  54. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331. IEEE.
  55. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.
  56. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  57. Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 273–282.
  58. Contrastive learning with hard negative samples. In International Conference on Learning Representations.
  59. Paul Rozin and Edward B Royzman. 2001. Negativity bias, negativity dominance, and contagion. Personality and social psychology review, 5(4):296–320.
  60. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  61. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086.
  62. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  63. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  64. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
  65. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  66. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  67. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839.
  68. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  69. Zephyr: Direct distillation of lm alignment.
  70. Not all emotions are created equal: the negativity bias in social-emotional development. Psychological bulletin, 134(3):383.
  71. Human-aligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20:27–40.
  72. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080.
  73. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  74. KGA: A general machine unlearning framework based on knowledge gap alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13264–13276, Toronto, Canada. Association for Computational Linguistics.
  75. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  76. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  77. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  78. Contrastive learning for cold-start recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5382–5390.
  79. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  80. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  81. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  82. Negative sampling for contrastive representation learning: A review. arXiv preprint arXiv:2206.00212.
  83. Large language model unlearning. arXiv preprint arXiv:2310.10683.
  84. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.
  85. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823.
  86. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048.
  87. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  88. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591.
  89. Variational f-divergence minimization. arXiv preprint arXiv:1907.11891.
  90. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  91. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3985–3995.
  92. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 12 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube