Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and Sustainable Language Models (2310.18333v3)

Published 20 Oct 2023 in cs.CL and cs.AI

Abstract: As the use of LLMs increases within society, as does the risk of their misuse. Appropriate safeguards must be in place to ensure LLM outputs uphold the ethical standards of society, highlighting the positive role that artificial intelligence technologies can have. Recent events indicate ethical concerns around conventionally trained LLMs, leading to overall unsafe user experiences. This motivates our research question: how do we ensure LLM alignment? In this work, we introduce a test suite of unique prompts to foster the development of aligned LLMs that are fair, safe, and robust. We show that prompting LLMs at every step of the development pipeline, including data curation, pre-training, and fine-tuning, will result in an overall more responsible model. Our test suite evaluates outputs from four state-of-the-art LLMs: GPT-3.5, GPT-4, OPT, and LLaMA-2. The assessment presented in this paper highlights a gap between societal alignment and the capabilities of current LLMs. Additionally, implementing a test suite such as ours lowers the environmental overhead of making models safe and fair.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. PaLM 2 Technical Report.
  2. Bantilan, N. 2017. Themis-ml: A Fairness-Aware Machine Learning Interface for End-To-End Discrimination Discovery and Mitigation. Journal of Technology in Human Services.
  3. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias.
  4. Language (Technology) is Power: A Critical Survey of ”Bias” in NLP. arXiv:2005.14050.
  5. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  6. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
  7. Brimicombe, C. 2022. Is there a climate change reporting bias? A case study of English-language news articles, 2017–2022. Geoscience Communication, 5(3): 281–287.
  8. Language Models are Few-Shot Learners. CoRR, abs/2005.14165.
  9. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
  10. Fairness in machine learning: A survey. ACM Computing Surveys.
  11. A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  12. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM.
  13. EuroNews. 2023. Biased Chatbot Prompts Concern. https://chat.openai.com/.
  14. A survey on bias in deep NLP. Applied Sciences, 11(7): 3184.
  15. Auto-Debias: Debiasing Masked Language Models with Automated Biased Prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1012–1023. Association for Computational Linguistics.
  16. Equality of Opportunity in Supervised Learning. CoRR, abs/1610.02413.
  17. Aligning AI With Shared Human Values.
  18. Privacy in Large Language Models: Attacks, Defenses and Future Directions.
  19. Towards Debiasing Sentence Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5502–5515. Association for Computational Linguistics.
  20. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5356–5371. Association for Computational Linguistics.
  21. Fair is Better than Sensational:Man is to Doctor as Woman is to Doctor. arXiv:1905.09866.
  22. Fairness in Machine Learning. In Recent Trends in Learning From Data, 155–196. Springer International Publishing.
  23. OpenAI. 2023. GPT-4 Technical Report.
  24. Training language models to follow instructions with human feedback.
  25. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2086–2105. Association for Computational Linguistics.
  26. Counterfactual Inference for Text Classification Debiasing. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), 5434–5445.
  27. A Trip Towards Fairness: Bias and De-Biasing in Large Language Models.
  28. Nbias: A natural language processing framework for BIAS identification in text. Expert Systems with Applications, 237: 121542.
  29. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. arXiv:2308.01263.
  30. Aequitas: A Bias and Fairness Audit Toolkit. CoRR, abs/1811.05577.
  31. NLPositionality: Characterizing Design Biases of Datasets and Models.
  32. Fairness GAN: Generating datasets with fairness properties using a generative adversarial network. IBM Journal of Research and Development, 63(4/5): 3:1–3:9.
  33. ”I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset.
  34. LLaMA: Open and Efficient Foundation Language Models.
  35. Llama 2: Open Foundation and Fine-Tuned Chat Models.
  36. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models.
  37. Fairlearn: Assessing and Improving Fairness of AI Systems. arXiv:2303.16626.
  38. The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics.
  39. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1).
  40. Adept: A debiasing prompt framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 10780–10788.
  41. Mitigating Unwanted Biases with Adversarial Learning. arXiv:1801.07593.
  42. Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation.
  43. OPT: Open Pre-trained Transformer Language Models.
  44. A Survey of Large Language Models. arXiv:2303.18223.
  45. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Veronica Chatrath (11 papers)
  2. Oluwanifemi Bamgbose (8 papers)
  3. Shaina Raza (53 papers)