Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Precision and Recall to assess the quality and diversity of LLMs (2402.10693v3)

Published 16 Feb 2024 in cs.CL and cs.LG

Abstract: We introduce a novel evaluation framework for LLMs such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art LLMs, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Florian Le Bronnec (4 papers)
  2. Alexandre Verine (6 papers)
  3. Benjamin Negrevergne (20 papers)
  4. Yann Chevaleyre (28 papers)
  5. Alexandre Allauzen (26 papers)
Citations (8)