Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

6 5 1

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (2306.11698v5)

Published 20 Jun 2023 in cs.CL, cs.AI, and cs.CR

Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for LLMs with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/ ; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .

References (215)

Authors (19)

Boxin Wang (28 papers)
Weixin Chen (10 papers)
Hengzhi Pei (13 papers)
Chulin Xie (27 papers)
Mintong Kang (17 papers)
Chenhui Zhang (16 papers)
Chejian Xu (18 papers)
Zidi Xiong (11 papers)
Ritik Dutta (2 papers)
Rylan Schaeffer (33 papers)
Sang T. Truong (12 papers)
Simran Arora (64 papers)
Mantas Mazeika (27 papers)
Dan Hendrycks (63 papers)
Zinan Lin (42 papers)
Yu Cheng (354 papers)
Sanmi Koyejo (111 papers)
Dawn Song (229 papers)
Bo Li (1107 papers)

Citations (331)

View on Semantic Scholar

Summary

Trustworthiness Evaluation of GPT Models: An Expert Overview

The paper entitled "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models" presents a thorough examination of the trustworthiness of state-of-the-art GPT models. With a specific focus on GPT-3.5 and GPT-4, the authors aim to assess the strengths, limitations, and potential vulnerability of these Generative Pre-trained Transformer models, recognized for their diverse applications across sensitive domains such as healthcare and finance. The discussion spans various trustworthiness perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. In addition to assessing GPT-3.5 and GPT-4, the paper extends evaluations to several leading open LLMs to facilitate a comprehensive understanding of their trustworthiness.

Evaluation Approach

The evaluation framework defined in this paper encompasses a multi-faceted approach addressing key aspects of trustworthiness. The overarching goal is to provide a holistic assessment that informs the research community about existing gaps and challenges in deploying GPT models in real-world situations. To this end, the authors meticulously design datasets and adversarial scenarios tailored to each dimension of trustworthiness. The paper explores the performance of models across standard benchmarks while constructing new adversarial tasks to stress-test these models under conditions close to real-world deployment. This rigorous evaluation highlights vulnerabilities that could manifest when these LLMs meet adversarial prompts or potentially harmful user interactions.

Key Findings

Toxicity and Stereotype Bias: The paper reveals that despite efforts in instruction tuning and RLHF, both GPT-3.5 and GPT-4 are susceptible to generating toxic content and stereotype biases, especially under adversarial prompting. The paper innovatively demonstrates how adversarial system prompts can bypass the models' protective mechanisms, consequently eliciting toxic outputs.

Adversarial Robustness: A notable observation is that GPT-4 surpasses GPT-3.5 with substantial improvements in adversarial robustness. However, when exposed to adversarial texts generated against stronger autoregressive models, they still show vulnerability. This underscores the challenges in ensuring reliable robustness in real-world applications.

Out-of-Distribution Robustness: A comprehensive evaluation on OOD tasks reveals that GPT-4 exhibits stronger generalization capabilities compared to GPT-3.5. Nevertheless, both models face difficulties on tasks with extreme OOD character, which indicates room for improvement in handling unseen or unexpected inputs.

Robustness to Adversarial Demonstrations: When tested with adversarial demonstrations, the models, particularly GPT-4, show susceptibility due to their enhanced instruction-following capabilities. The paper designs effective tests that reveal these weaknesses, providing invaluable insight into improving in-context learning techniques.

Privacy: The leakage of private information from both pre-training data and interaction histories is identified as a major concern, implicating the need for enhanced privacy-preserving techniques in future models.

Machine Ethics and Fairness: The investigation into machine ethics suggests that while GPT-4 competently recognizes ethical norms, it might be influenced by adversarial prompts. In exploring fairness, the paper points out that model predictions can be affected by demographically imbalanced demonstration contexts, reflecting an accuracy-fairness tradeoff.

Practical Implications and Future Prospects

This research offers pivotal insights into the ongoing development of more secure and ethical LLMs. The identification of GPT models' vulnerabilities under diverse scenarios advocates for the incorporation of more robust risk mitigation strategies prior to wide-scale deployment. This paper thus serves as a foundational reference for AI safety researchers to develop methods to counteract these gaps, informing improvements in LLM architectures, training algorithms, and evaluation benchmarks.

In terms of future work, there is a demand for methods to systematically enhance the robustness of these models against increasingly sophisticated adversarial attacks, as well as ensuring compliance with evolving ethical and privacy standards. Furthermore, advanced verification techniques that provide guarantees for model robustness and fairness, while aligning model outputs with human ethical norms, are essential. Safeguarding LLMs through logical reasoning and domain-specific knowledge integration will also be critical in bridging the present trustworthiness gaps.

In conclusion, this paper is a comprehensive resource that not only dissects the trustworthiness of high-impact GPT models but also equips the community with tools to fortify the models' deployments, pivotal to AI's responsible advancement.

PDF Markdown

GitHub

DecodingTrust Benchmark

Tweets

https://twitter.com/akaclandestine/status/1854468655121154192

https://twitter.com/HoTSoSSymposium/status/1907911200559816944

https://twitter.com/sudosym/status/1810417069097467939

https://twitter.com/22146921/status/1737124528697086335

YouTube

Show All Videos