Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations (2404.09785v1)
Abstract: This paper introduces fourteen novel datasets for the evaluation of LLMs' safety in the context of enterprise tasks. A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content. In this research, we used OpenAI GPT as point of comparison since it excels at all levels of safety. On the open-source side, for smaller models, Meta Llama2 performs well at factuality and toxicity but has the highest propensity for hallucination. Mistral hallucinates the least but cannot handle toxicity well. It performs well in a dataset mixing several tasks and safety vectors in a narrow vertical domain. Gemma, the newly introduced open-source model based on Google Gemini, is generally balanced but trailing behind. When engaging in back-and-forth conversation (multi-turn prompts), we find that the safety of open-source models degrades significantly. Aside from OpenAI's GPT, Mistral is the only model that still performed well in multi-turn tests.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Purple llama cyberseceval: A secure coding benchmark for language models.
- Truth, lies, and automation.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
- Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation.
- A holistic approach to undesired content detection in the real world.
- Gaia: a benchmark for general ai assistants.
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
- Red teaming language models with language models.
- Exploring the limits of transfer learning with a unified text-to-text transformer.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.