Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 59 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations (2404.09785v1)

Published 15 Apr 2024 in cs.CL

Abstract: This paper introduces fourteen novel datasets for the evaluation of LLMs' safety in the context of enterprise tasks. A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content. In this research, we used OpenAI GPT as point of comparison since it excels at all levels of safety. On the open-source side, for smaller models, Meta Llama2 performs well at factuality and toxicity but has the highest propensity for hallucination. Mistral hallucinates the least but cannot handle toxicity well. It performs well in a dataset mixing several tasks and safety vectors in a narrow vertical domain. Gemma, the newly introduced open-source model based on Google Gemini, is generally balanced but trailing behind. When engaging in back-and-forth conversation (multi-turn prompts), we find that the safety of open-source models degrades significantly. Aside from OpenAI's GPT, Mistral is the only model that still performed well in multi-turn tests.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  2. Purple llama cyberseceval: A secure coding benchmark for language models.
  3. Truth, lies, and automation.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding.
  6. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  7. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
  8. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.
  9. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  10. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  11. Summac: Re-visiting nli-based models for inconsistency detection in summarization.
  12. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  13. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  14. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  15. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation.
  16. A holistic approach to undesired content detection in the real world.
  17. Gaia: a benchmark for general ai assistants.
  18. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  19. Red teaming language models with language models.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer.
  21. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks.
  22. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: