Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores (2403.00553v2)

Published 1 Mar 2024 in cs.CL

Abstract: The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. CoRR, abs/2303.09540, 2023. doi: 10.48550/ARXIV.2303.09540. URL https://doi.org/10.48550/arXiv.2303.09540.
  2. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all, 2023.
  3. Text reuse detection using a composition of text similarity measures. In Martin Kay and Christian Boitet (eds.), Proceedings of COLING 2012, pp.  167–184, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL https://aclanthology.org/C12-1011.
  4. How many words do we know? practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology, 7, 2016. URL https://api.semanticscholar.org/CorpusID:14280326.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90 URL https://vicuna.lmsys.org.
  6. Lexical repetitions lead to rote learning: Unveiling the impact of lexical overlap in train and test reference summaries. arXiv preprint arXiv:2311.09458, 2023.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  8. Cutting the gordian knot: The moving-average type–token ratio (mattr). Journal of Quantitative Linguistics, 17:100 – 94, 2010. URL https://api.semanticscholar.org/CorpusID:18924254.
  9. Language models for image captioning: The quirks and what works. In Chengqing Zong and Michael Strube (eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  100–105, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-2017. URL https://aclanthology.org/P15-2017.
  10. The curious decline of linguistic diversity: Training language models on synthetic text. CoRR, abs/2311.09807, 2023. doi: 10.48550/ARXIV.2311.09807. URL https://doi.org/10.48550/arXiv.2311.09807.
  11. Teaching machines to read and comprehend. ArXiv, abs/1506.03340, 2015. URL https://api.semanticscholar.org/CorpusID:6203757.
  12. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  13. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
  14. Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
  15. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  16. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
  17. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
  18. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
  19. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp.  605–612, Barcelona, Spain, July 2004. doi: 10.3115/1218955.1219032. URL https://aclanthology.org/P04-1077.
  20. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:257804696.
  21. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256415991.
  22. Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42:381–392, 2010. URL https://api.semanticscholar.org/CorpusID:42852342.
  23. On decoding strategies for neural text generators. Trans. Assoc. Comput. Linguistics, 10:997–1012, 2022. URL https://transacl.org/ojs/index.php/tacl/article/view/3807.
  24. Locally typical sampling. Trans. Assoc. Comput. Linguistics, 11:102–121, 2023a. URL https://transacl.org/ojs/index.php/tacl/article/view/3993.
  25. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023b. doi: 10.1162/tacl˙a˙00536. URL https://aclanthology.org/2023.tacl-1.7.
  26. Cross-task generalization via natural language crowdsourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.244. URL https://aclanthology.org/2022.acl-long.244.
  27. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018. URL https://api.semanticscholar.org/CorpusID:215768182.
  29. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  30. Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? ArXiv, abs/2309.05196, 2023a. URL https://api.semanticscholar.org/CorpusID:261682154.
  31. Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? arXiv preprint arXiv:2309.05196, 2023b.
  32. Investigating the representation of open domain dialogue context for transformer models. In Svetlana Stoyanchev, Shafiq Joty, David Schlangen, Ondrej Dusek, Casey Kennington, and Malihe Alikhani (eds.), Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp.  538–547, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.sigdial-1.50. URL https://aclanthology.org/2023.sigdial-1.50.
  33. Self-repetition in abstractive neural summarizers. In Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (AACL), volume 2022, pp.  341–350, 2022. URL https://aclanthology.org/2022.aacl-short.42.
  34. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  35. On the importance of diversity in question generation for QA. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5651–5656, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.500. URL https://aclanthology.org/2020.acl-main.500.
  36. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11626–11644, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.650. URL https://aclanthology.org/2023.acl-long.650.
  37. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  38. Evaluating the evaluation of diversity in natural language generation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  326–346, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.25. URL https://aclanthology.org/2021.eacl-main.25.
  39. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a. URL https://api.semanticscholar.org/CorpusID:257219404.
  40. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  41. Automated metrics for medical multi-document summarization disagree with human evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9871–9889, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.549. URL https://aclanthology.org/2023.acl-long.549.
  42. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  43. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  44. Texygen: A benchmarking platform for text generation models. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018. URL https://api.semanticscholar.org/CorpusID:3636178.
Citations (15)

Summary

  • The paper presents a novel, open-source tool that standardizes text diversity measurement in LLM outputs.
  • It empirically compares metrics like compression ratios and self-repetition scores to determine their effectiveness.
  • The findings offer actionable insights for enhancing language model diversity and improving dataset evaluation.

Standardizing the Measurement of Text Diversity: A Comprehensive Tool and Analysis

Overview of Diversity Measurement in LLMs

The importance of diversity in the outputs generated by LLMs cannot be understated, impacting both the perception and utility of these models. This paper presents an empirical investigation into the measurement of text diversity, focusing on English texts from various models. It highlights the limitation in current practices where no standardized score exists for quantifying diversity, often leading to inconsistencies in evaluating and reporting model performance. By exploring a range of diversity scores and introducing a novel open-source tool, this work seeks to address these gaps, offering a unified framework for future research and application in the field.

Identifying Effective Diversity Scores

A critical contribution of this research is the comparative analysis of existing methods for measuring text diversity, including compression ratios, self-repetition of n-grams, Self-BLEU, and BERTScore, among others. The findings suggest that a combination of computationally efficient methods such as compression ratios, alongside more nuanced measures like self-repetition scores, can effectively capture text diversity without the need for mutual correlation. This multifaceted approach to scoring is envisioned to aid in a more accurate and comprehensive assessment of LLM outputs.

Practical Applications and Theoretical Implications

The implications of this work are twofold. Practically, the diversity scoring tool released as part of this paper offers a standardized method for evaluating text diversity, with potential applications extending beyond LLM output analysis to areas like instruction-tuning datasets and human-produced texts. Theoretically, the insights gained from this comparative analysis contribute to a deeper understanding of how diversity in text can be quantified and optimized, potentially driving advancements in model development and training methodologies.

Future Prospects in AI and LLM Development

Looking forward, this research opens several avenues for further exploration. The identified scores provide a foundation for developing more sophisticated models that can produce diverse and high-quality text. It also raises questions about the relationship between text length and diversity, suggesting a need for innovative solutions that can account for this variable in future diversity assessments. Additionally, by standardizing the measurement of text diversity, this work may catalyze new research endeavors aimed at enhancing the creativity and variability of LLMs.

Conclusions

In conclusion, this paper makes significant strides towards standardizing the measurement of text diversity within the field of LLMs. By empirically analyzing and comparing different diversity scores, releasing a comprehensive tool for diversity evaluation, and discussing the broader implications of this work, it sets a new standard for future research on LLMs and generative AI. As the field continues to evolve, the methodologies and insights presented in this paper will undoubtedly play a crucial role in shaping the development of more diverse and capable LLMs.