Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models (2306.11507v1)

Published 20 Jun 2023 in cs.CL and cs.AI
TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models

Abstract: LLMs such as ChatGPT, have gained significant attention due to their impressive natural language processing capabilities. It is crucial to prioritize human-centered principles when utilizing these models. Safeguarding the ethical and moral compliance of LLMs is of utmost importance. However, individual ethical issues have not been well studied on the latest LLMs. Therefore, this study aims to address these gaps by introducing a new benchmark -- TrustGPT. TrustGPT provides a comprehensive evaluation of LLMs in three crucial areas: toxicity, bias, and value-alignment. Initially, TrustGPT examines toxicity in LLMs by employing toxic prompt templates derived from social norms. It then quantifies the extent of bias in models by measuring quantifiable toxicity values across different groups. Lastly, TrustGPT assesses the value of conversation generation models from both active value-alignment and passive value-alignment tasks. Through the implementation of TrustGPT, this research aims to enhance our understanding of the performance of conversation generation models and promote the development of LLMs that are more ethical and socially responsible.

TRUST GPT: Benchmarking Ethical Considerations in LLMs

Introduction to TRUST GPT

The evolution of LLMs has introduced complex challenges in ensuring their ethical and responsible use. TRUST GPT emerges as a tailored benchmark that rigorously evaluates the ethical dimensions of LLMs, focusing on toxicity, bias, and value-alignment. It aims to illuminate the ethical intricacies of cutting-edge models such as ChatGPT, highlighting the critical areas requiring intervention to foster the development of more ethically aligned LLMs. TRUST GPT's comprehensive analysis leverages an empirical approach to scrutinize eight recent LLMs, uncovering significant ethical concerns that necessitate further rectification.

Methodology and Design

Toxicity Examination

TRUST GPT explores the generation of toxic content by prompting LLMs with scenarios reflecting diverse social norms. Employing PERSPECTIVE API, it quantitatively assesses the toxicity levels of responses, attempting to penetrate the models' RLHF-trained barriers to revealing underlying toxic potentials.

Bias Analysis

The benchmark probes into the model's bias by generating responses across different demographic markers, using three key metrics: average toxicity score, standard deviation across demographics, and statistical significance via the Mann-Whitney U test. This multi-dimensional approach seeks to unravel the nuanced biases encapsulated within these sophisticated models.

Value-Alignment Evaluation

TRUST GPT stratifies value-alignment into active (AVA) and passive (PVA) categories. AVA tests the model's ethical judgment, requiring it to choose among predefined moral alignments. PVA examines the model's response to prompts conflicting with societal norms, evaluating its propensity to refute engagement.

Empirical Findings and Discourse

The results from applying TRUST GPT on selected models unveil a nuanced landscape of ethical comportments. Although advancements in RLHF techniques have mitigated toxicity to some extent, notable concerns linger, especially under carefully crafted prompts. Bias detection reveals variable sensitivities across demographics, underscoring the intricate balance needed in model training to avoid stereotypical inclinations. In value-alignment tasks, the benchmark highlights a discrepancy in models' ability to actively make ethical judgments and their resilience against generating content from ethically controversial prompts.

Implications and Forward-Look

TRUST GPT's insights draw attention to the imperative need for continuous, intricate scrutiny of ethical aspects in LLMs' development trajectory. The identification of toxicity and bias underscores the necessity for enhanced mitigation strategies, possibly incorporating broader human feedback and diverse datasets in RLHF cycles. Value-alignment outcomes advocate for sophisticated model training that encapsulates a wider spectrum of ethical reasoning abilities.

The paper predicates a future wherein benchmarks like TRUST GPT play a pivotal role in shaping the ethical contours of LLM development, instigating a paradigm where models not only excel in linguistic prowess but also in embodying societal values and norms. It sets a precedent for subsequent research to build upon, aiming for LLMs that are not merely technologically advanced but also ethically conscientious.

Concluding Remarks

The endeavors encapsulated in TRUST GPT serve as a stepping stone towards realizing LLMs that align closely with human ethical standards. The benchmark opens avenues for constructive discourse and enhancements in modeling practices, aspiring for a future where LLMs seamlessly integrate into the societal fabric, championing both innovation and ethical integrity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. OpenAI. Chatgpt, 2023a. https://openai.com/product/chatgpt.
  2. OpenAI. Gpt-4, 2023b. https://openai.com/product/gpt-4.
  3. Llama: Open and efficient foundation language models, 2023.
  4. Stanford alpaca: An instruction-following llama model, 2023. https://github.com/tatsu-lab/stanford_alpaca.
  5. vicuna, 2023. https://lmsys.org/blog/2023-03-30-vicuna/.
  6. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
  7. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  8. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  9. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  10. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
  11. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390, 2022.
  12. Probing toxic content in large pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4262–4274, 2021.
  13. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061, 2022.
  14. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  15. Biasasker: Measuring the bias in conversational ai system. arXiv preprint arXiv:2305.12434, 2023.
  16. Unified detoxifying and debiasing in language generation via inference-time adaptive optimization. arXiv preprint arXiv:2210.04492, 2022.
  17. Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035, 2019.
  18. Mitigating political bias in language models through reinforced calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14857–14866, 2021.
  19. Mitigating gender bias in distilled language models via counterfactual role reversal. arXiv preprint arXiv:2203.12574, 2022.
  20. Gender bias in neural natural language processing. Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday, pages 189–202, 2020.
  21. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1012–1023, 2022.
  22. Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891, 2019.
  23. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  24. Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867, 2023.
  25. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Social chemistry 101: Learning to reason about social and moral norms. arXiv preprint arXiv:2011.00620, 2020.
  28. Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032, 2020.
  29. Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133, 2020.
  30. Measuring bias in contextualized word representations. arXiv preprint arXiv:1906.07337, 2019.
  31. On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561, 2019.
  32. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
  33. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.
  34. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  35. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  36. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  37. Document-level machine translation with large language models, 2023.
  38. Semantic compression with large language models, 2023.
  39. J. Manyika. an early experiment with generative ai, 2023. https://bard.google.com/.
  40. Palm: Scaling language modeling with pathways, 2022.
  41. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021.
  42. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872, 2021.
  43. Can machines learn morality? the delphi experiment. arXiv e-prints, pages arXiv–2110, 2021.
  44. “i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9180–9211, 2022.
  45. Unqovering stereotyping biases via underspecified questions. arXiv preprint arXiv:2010.02428, 2020.
  46. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
  47. Towards identifying social bias in dialog systems: Framework, dataset, and benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3576–3591, 2022.
  48. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  49. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  50. Enabling classifiers to make judgements explicitly aligned with human values. arXiv preprint arXiv:2210.07652, 2022.
  51. The FastChat developers. Fastchat-t5: a chat assistant fine-tuned from flan-t5 by lmsys, 2023. https://github.com/lm-sys/FastChat#FastChat-T5.
  52. Glm-130b: An open bilingual pre-trained model, 2022.
  53. an open assistant for everyone by laion, 2023. https://open-assistant.io/.
  54. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
  55. Wikipedia about social norm, 2023. https://en.wikipedia.org/wiki/Social_norm.
  56. Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
  57. Siméon-Denis Poisson. Recherches sur la probabilité des jugements en matière criminelle et en matière civile: précédées des règles générales du calcul des probabilités. Bachelier, 1837.
  58. Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450, 2023.
  59. Measuring fairness with biased rulers: A survey on quantifying biases in pretrained language models. arXiv preprint arXiv:2112.07447, 2021.
  60. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  61. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yue Huang (171 papers)
  2. Qihui Zhang (13 papers)
  3. Philip S. Y (1 paper)
  4. Lichao Sun (186 papers)
Citations (34)