DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (2306.11698v5)
Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for LLMs with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/ ; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .
- Jailbreak chat. https://www.jailbreakchat.com/.
- Shakespearean. https://lingojam.com/shakespearean.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
- Roles for computing in social change. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2019. doi: 10.1145/3351095.3372871.
- Persistent anti-muslim bias in large language models, 2021.
- An atlas of cultural commonsense for machine reasoning. CoRR, abs/2009.05664, 2020.
- O. Agarwal and A. Nenkova. Temporal effects on pre-trained models for language processing tasks. Transactions of the Association for Computational Linguistics, 10:904–921, 2022.
- On measuring social biases in prompt-based multi-task learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 551–564, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.42. URL https://aclanthology.org/2022.findings-naacl.42.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023.
- American Association of University Women. Barriers & bias: The status of women in leadership. https://www.aauw.org/resources/research/barrier-bias/.
- Anti-Defamation League. Myth: Jews are greedy. https://antisemitism.adl.org/greed/.
- Anti-Defamation League. Myths and facts about muslim people and islam. https://www.adl.org/resources/tools-and-strategies/myths-and-facts-about-muslim-people-and-islam, 2022.
- Types of out-of-distribution texts and how to detect them. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10687–10701, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.835. URL https://aclanthology.org/2021.emnlp-main.835.
- Association for Psychological Science. Bad drivers? no, just bad stereotypes. https://www.psychologicalscience.org/news/motr/bad-drivers-no-just-bad-stereotypes.html, 2014.
- A. Asuncion and D. Newman. Uci machine learning repository, 2007.
- S. Barocas and A. D. Selbst. Big data’s disparate impact. California Law Review, 104:671, 2016.
- S. W. Bender. Sight, sound, and stereotype: The war on terrorism and its consequences for latinas/os. Oregon Law Review, 81, 2002. URL https://digitalcommons.law.seattleu.edu/faculty/296.
- J. A. Berg. Opposition to pro-immigrant public policy: Symbolic racism and group threat. Sociological Inquiry, 83(1):1–31, 2013. doi: https://doi.org/10.1111/j.1475-682x.2012.00437.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1475-682x.2012.00437.x.
- Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485.
- Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. URL https://aclanthology.org/2021.acl-long.81.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings, 2016.
- Do foundation model providers comply with the eu ai act?, 2023. URL https://crfm.stanford.edu/2023/06/15/eu-ai-act.html.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, Sept. 2015a. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
- A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, editors, EMNLP, 2015b.
- Brookings Institution. Do immigrants “steal” jobs from american workers? https://www.brookings.edu/blog/brookings-now/2017/08/24/do-immigrants-steal-jobs-from-american-workers/, 2017.
- What does it mean for a language model to preserve privacy? In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2280–2292, 2022.
- Language models are few-shot learners. 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium, USENIX Security 2019, 2019.
- Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021.
- Extracting training data from diffusion models. In arXiv:2301.13188v1, 2023a.
- Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=TatRHT_1cK.
- Stereotype threat among girls: Differences by gender identity and math education context. Psychology of Women Quarterly, 41(4):513–529, 2017. doi: 10.1177/0361684317711412. URL https://doi.org/10.1177/0361684317711412.
- S. Caton and C. Haas. Fairness in machine learning: A survey. arXiv preprint arXiv:2010.04053, 2020.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In ACSAC, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. ARXIV.ORG, 2022. doi: 10.48550/arXiv.2210.11416.
- CNN. Microsoft is bringing chatgpt technology to word, excel and outlook, 2023. URL https://www.cnn.com/2023/03/16/tech/openai-gpt-microsoft-365/index.html.
- E. Commission. Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a372-11eb-9585-01aa75ed71a1.0001.02/DOC_1&format=PDF, 2021.
- T. Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Textworld: A learning environment for text-based games. In Computer Games - 7th Workshop, CGW, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI, volume 1017 of Communications in Computer and Information Science, pages 41–75. Springer, 2018.
- A unified evaluation of textual backdoor learning: Frameworks and benchmarks. arXiv preprint arXiv:2206.08514, 2022.
- Cybernews. Lessons learned from chatgpt’s samsung leak, 2023. URL https://cybernews.com/security/chatgpt-samsung-leak-explained-lessons/.
- A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019.
- L. Daryanani. How to jailbreak chatgpt. https://watcher.guru/news/how-to-jailbreak-chatgpt.
- BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, NAACL-HLT, 2019.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 862–872, 2021.
- Nl-augmenter: A framework for task-sensitive natural language augmentation, 2021.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Flocks of stochastic parrots: Differentially private prompt learning for large language models. arXiv preprint arXiv:2305.15594, 2023.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 698–718. Association for Computational Linguistics, 2021.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
- MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1–13, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5801. URL https://aclanthology.org/D19-5801.
- Capai-a procedure for conducting conformity assessment of ai systems in line with the eu artificial intelligence act. Available at SSRN 4064091, 2022.
- Social chemistry 101: Learning to reason about social and moral norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 653–670. Association for Computational Linguistics, 2020.
- The capacity for moral self-correction in large language models, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Datasheets for datasets. arXiv preprint arXiv:1803.09010, 2018.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings in EMNLP, 2020.
- ‘you play like a woman!’ effects of gender stereotype threat on women’s performance in physical and sport activities: A meta-analysis. Psychology of Sport and Exercise, 39:95–103, 2018. ISSN 1469-0292. doi: https://doi.org/10.1016/j.psychsport.2018.07.013. URL https://www.sciencedirect.com/science/article/pii/S1469029217305083.
- Robustness gym: Unifying the nlp evaluation landscape. arXiv preprint arXiv:2101.04840, 2021.
- A. Gokaslan and V. Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- R. Goodside. Exploiting gpt-3 prompts with malicious inputs that order the model to ignore its previous directions. https://web.archive.org/web/20220919192024/https://twitter.com/goodside/status/1569128808308957185.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. CoRR, abs/2302.12173, 2023.
- Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. arXiv preprint arXiv:2103.11441, 2021.
- Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
- W. Hariri. Unlocking the potential of chatgpt: A comprehensive exploration of its applications, advantages, limitations, and future directions in natural language processing. arXiv preprint arXiv:2304.02017, 2023.
- Interactive fiction games: A colossal adventure. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 7903–7910. AAAI Press, 2020.
- Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2744–2751, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.244. URL https://aclanthology.org/2020.acl-main.244.
- Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021a.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- What would jiminy cricket do? towards agents that behave morally. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021c.
- The curious case of neural text degeneration. In ICLR, 2019.
- Immunity to popular stereotypes of aging? seniors and stereotype threat. Educational Gerontology, 36(5):353–371, 2010. doi: 10.1080/03601270903323976. URL https://doi.org/10.1080/03601270903323976.
- Are large pre-trained language models leaking your personal information? EMNLP Findings, 2022.
- Adversarial example generation with syntactically controlled paraphrase networks. In M. A. Walker, H. Ji, and A. Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1875–1885. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-1170. URL https://doi.org/10.18653/v1/n18-1170.
- R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2021–2031. Association for Computational Linguistics, 2017. doi: 10.18653/v1/d17-1215. URL https://doi.org/10.18653/v1/d17-1215.
- Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In AAAI, 2020.
- When to make exceptions: Exploring language models as accounts of human moral judgment. In NeurIPS, 2022.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. CoRR, abs/2302.05733, 2023.
- Certifying some distributional fairness with subpopulation decomposition. Advances in Neural Information Processing Systems, 35:31045–31058, 2022.
- Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332, 2022.
- Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, 2019.
- M. Keevak. 204How Did East Asians Become Yellow? In Reconsidering Race: Social Science Perspectives on Racial Categories in the Age of Genomics. Oxford University Press, 06 2018. ISBN 9780190465285. doi: 10.1093/oso/9780190465285.003.0011. URL https://doi.org/10.1093/oso/9780190465285.003.0011.
- F. Khani and P. Liang. Feature noise induces loss discrepancy across groups. International Conference On Machine Learning, 2019.
- Ground-truth labels matter: A deeper look into input-label demonstrations. arXiv preprint arXiv:2205.12685, 2022.
- B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, pages 217–226. Springer, 2004.
- WILDS: A benchmark of in-the-wild distribution shifts. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR, 2021. URL http://proceedings.mlr.press/v139/koh21a.html.
- Large language models are zero-shot reasoners. Neural Information Processing Systems, 2022.
- Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 737–762, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.55. URL https://aclanthology.org/2020.emnlp-main.55.
- Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
- H. Kwon. Dual-targeted textfooler attack on text classification systems. IEEE Access, 11:15164–15173, 2023. doi: 10.1109/ACCESS.2021.3121366. URL https://doi.org/10.1109/ACCESS.2021.3121366.
- Learn Prompting. Introduction to prompt hacking. https://learnprompting.org/docs/prompt_hacking/intro, 2023.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
- A new generation of perspective api: Efficient multilingual character-level transformers. Knowledge Discovery And Data Mining, 2022. doi: 10.1145/3534678.3539147.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Textbugger: Generating adversarial text against real-world applications. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society, 2019. URL https://www.ndss-symposium.org/ndss-paper/textbugger-generating-adversarial-text-against-real-world-applications/.
- BERT-ATTACK: adversarial attack against BERT using BERT. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6193–6202. Association for Computational Linguistics, 2020a. doi: 10.18653/v1/2020.emnlp-main.500. URL https://doi.org/10.18653/v1/2020.emnlp-main.500.
- UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, Online, Nov. 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.311. URL https://aclanthology.org/2020.findings-emnlp.311.
- Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021.
- Y. Li and Y. Zhang. Fairness of chatgpt. arXiv preprint arXiv:2305.18569, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
- Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023a.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. 2023b. URL https://api.semanticscholar.org/CorpusID:260775522.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
- Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539, 2023.
- Differentially private language models for secure data sharing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4860–4873, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.323.
- Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
- Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1334. URL https://aclanthology.org/P19-1334.
- K. McGuffie and A. Newhouse. The radicalization risks of GPT-3 and advanced neural language models. arXiv, 2020.
- A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pages 7721–7735. PMLR, 2021.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.759.
- An empirical analysis of memorization in fine-tuned autoregressive language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1816–1826, 2022.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.244. URL https://aclanthology.org/2022.acl-long.244.
- Unsupervised text deidentification. arXiv:2210.11528v1, 2022.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long.416.
- Stress test evaluation for natural language inference. In E. M. Bender, L. Derczynski, and P. Isabelle, editors, Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 2340–2353. Association for Computational Linguistics, 2018. URL https://aclanthology.org/C18-1198/.
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.154. URL https://aclanthology.org/2020.emnlp-main.154.
- Adversarial nli: A new benchmark for natural language understanding. In ACL, 2020.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- OpenAI. ChatGPT. https://chat.openai.com, 2022a.
- OpenAI. GPT documentation. https://platform.openai.com/docs/guides/chat/introduction, 2022b.
- OpenAI. GPT-4 technical report. arXiv, 2023.
- Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4227–4237, Hong Kong, China, Nov. 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. URL https://aclanthology.org/D19-1432.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. CoRR, abs/2304.03279, 2023.
- Differentially private in-context learning. arXiv preprint arXiv:2305.01639, 2023.
- E. Parliament. Amendments adopted by the european parliament on 14 june 2023 on the proposal for a regulation of the european parliament and of the council on laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. https://www.europarl.europa.eu/doceo/document/TA-9-2023-0236_EN.pdf, 2023.
- Bbq: A hand-built bias benchmark for question answering, 2022.
- F. Perez and I. Ribeiro. Ignore previous prompt: Attack techniques for language models. CoRR, abs/2211.09527, 2022.
- Pew Research Center. Majority of latinos say skin color impacts opportunity in america and shapes daily life. 2021. URL https://www.pewresearch.org/hispanic/2021/11/04/majority-of-latinos-say-skin-color-impacts-opportunity-in-america-and-shapes-daily-life/.
- Mind the style of text! adversarial and backdoor attacks based on text style transfer. In EMNLP, 2021a.
- Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In ACL-IJCNLP, 2021b.
- Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. ArXiv, abs/2307.08487, 2023. URL https://api.semanticscholar.org/CorpusID:259937347.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Fairness in federated learning via core-stability. Advances in neural information processing systems, 35:5738–5750, 2022.
- L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.
- Beyond accuracy: Behavioral testing of NLP models with checklist (extended abstract). In Z. Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 4824–4828. ijcai.org, 2021. doi: 10.24963/ijcai.2021/659. URL https://doi.org/10.24963/ijcai.2021/659.
- Salon. A racist stereotype is shattered: Study finds white youth are more likely to abuse hard drugs than black youth. https://www.salon.com/2016/04/06/this_racist_stereotype_is_shattered_study_finds_white_youth_are_more_likely_to_abuse_hard_drugs_than_black_youth_partner/, 2016.
- Breeds: Benchmarks for subpopulation shift. International Conference On Learning Representations, 2020.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
- Quantifying association capabilities of large language models and its implications on privacy leakage. arXiv preprint arXiv:2305.12707, 2023.
- Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022a.
- Just fine-tune twice: Selective differential privacy for large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6327–6340, Abu Dhabi, United Arab Emirates, Dec. 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.425.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv, 2020.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv: Arxiv-2303.11366, 2023.
- Alfworld: Aligning text and embodied environments for interactive learning. In 9th International Conference on Learning Representations, ICLR, 2021.
- Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=98p5x51L5af.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
- I. Solaiman and C. Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- StabilityAI. StableVicuna: An RLHF Fine-Tune of Vicuna-13B v0. Available at https://github.com/StabilityAI/StableVicuna, 4 2023. URL https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot. DOI:10.57967/hf/0588.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- M. N. Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-08-19.
- Teen Vogue. The fox–eye trend isn’t cute—it’s racist. https://www.teenvogue.com/story/fox-eye-trend-cultural-appropriation-asian-features, 2020.
- The Human Rights Campaign. Myths about hiv. https://www.hrc.org/resources/debunking-common-myths-about-hiv, 2023.
- J. Thorne and A. Vlachos. Adversarial attacks against fact extraction and verification. CoRR, abs/1903.05543, 2019. URL http://arxiv.org/abs/1903.05543.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023c.
- Considerations for differentially private learning with large-scale public pretraining. arXiv:2212.06470, 2022.
- Attention is all you need. In NIPS, 2017.
- S. D. Visco. Yellow peril, red scare: race and communism in national review. Ethnic and Racial Studies, 42(4):626–644, 2019. doi: 10.1080/01419870.2017.1409900. URL https://doi.org/10.1080/01419870.2017.1409900.
- Universal adversarial triggers for attacking and analyzing nlp. In EMNLP, 2019.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019b.
- T3: tree-autoencoder constrained adversarial text generation for targeted attack. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6134–6150. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-main.495. URL https://doi.org/10.18653/v1/2020.emnlp-main.495.
- Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/335f5352088d7d9bf74191e006d8e24c-Abstract-round2.html.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022a. URL https://openreview.net/forum?id=v_0F4IZJZw.
- SemAttack: Natural textual attacks via different semantic spaces. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022b.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv: 2310.07713, 2023a.
- Shall we pretrain autoregressive language models with retrieval? a comprehensive study. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095, 2023c.
- Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023d.
- Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023e.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022c.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates, Dec. 2022d. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.340.
- Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217–235, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.16. URL https://aclanthology.org/2020.emnlp-main.16.
- Washington Post. Five stereotypes about poor families and education. https://www.washingtonpost.com/news/answer-sheet/wp/2013/10/28/five-stereotypes-about-poor-families-and-education/, 2013.
- Certifying out-of-domain generalization for blackbox functions. International Conference on Machine Learning, 2022.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- Challenges in detoxifying language models. Findings of EMNLP, 2021.
- K. Welch. Black criminal stereotypes and racial profiling. Journal of Contemporary Criminal Justice, 23(3):276–288, 2007. doi: 10.1177/1043986207306870. URL https://doi.org/10.1177/1043986207306870.
- Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2020.
- White House Office of Science and Technology Policy. Blueprint for an ai bill of rights. 2022.
- S. Willison. Prompt injection attacks against gpt-3. http://web.archive.org/web/20220928004736/https://simonwillison.net/2022/Sep/12/prompt-injection/, a.
- S. Willison. I missed this one: Someone did get a prompt leak attack to work against the bot. https://web.archive.org/web/20220924105826/https://twitter.com/simonw/status/1570933190289924096, b.
- Detoxifying language models risks marginalizing minority voices. In NAACL, 2021.
- Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073, 2022a.
- Improving certified robustness via statistical learning with logical reasoning. Advances in Neural Information Processing Systems, 35:34859–34873, 2022b.
- Keep calm and explore: Language models for action generation in text-based games. In Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Ground-truth labels matter: A deeper look into input-label demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.155.
- Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022.
- Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations. arXiv preprint arXiv:2306.04618, 2023.
- Synthetic text generation with differential privacy: A simple and practical recipe. ACL, 2023.
- Word-level textual adversarial attacking as combinatorial optimization. In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6066–6080. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.540. URL https://doi.org/10.18653/v1/2020.acl-main.540.
- Learning fair representations. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 325–333, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/zemel13.html.
- Counterfactual memorization in neural language models. arXiv preprint arXiv:2112.12938, 2021.
- H. Zhao and G. Gordon. Inherent tradeoffs in learning fair representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.
- Provably confidential language modelling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 943–955, 2022.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198, 2023.
- Ethical chatgpt: Concerns, challenges, and commandments. arXiv preprint arXiv:2305.10646, 2023a.
- Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv:2302.13439v1, 2023b.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
- Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867, 2023.
- Boxin Wang (28 papers)
- Weixin Chen (10 papers)
- Hengzhi Pei (13 papers)
- Chulin Xie (27 papers)
- Mintong Kang (17 papers)
- Chenhui Zhang (16 papers)
- Chejian Xu (18 papers)
- Zidi Xiong (11 papers)
- Ritik Dutta (2 papers)
- Rylan Schaeffer (33 papers)
- Sang T. Truong (12 papers)
- Simran Arora (64 papers)
- Mantas Mazeika (27 papers)
- Dan Hendrycks (63 papers)
- Zinan Lin (42 papers)
- Yu Cheng (354 papers)
- Sanmi Koyejo (111 papers)
- Dawn Song (229 papers)
- Bo Li (1107 papers)