Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores (2403.00553v2)
Abstract: The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. CoRR, abs/2303.09540, 2023. doi: 10.48550/ARXIV.2303.09540. URL https://doi.org/10.48550/arXiv.2303.09540.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all, 2023.
- Text reuse detection using a composition of text similarity measures. In Martin Kay and Christian Boitet (eds.), Proceedings of COLING 2012, pp. 167–184, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL https://aclanthology.org/C12-1011.
- How many words do we know? practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology, 7, 2016. URL https://api.semanticscholar.org/CorpusID:14280326.
- Vicuna: An open-source chatbot impressing gpt-4 with 90 URL https://vicuna.lmsys.org.
- Lexical repetitions lead to rote learning: Unveiling the impact of lexical overlap in train and test reference summaries. arXiv preprint arXiv:2311.09458, 2023.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Cutting the gordian knot: The moving-average type–token ratio (mattr). Journal of Quantitative Linguistics, 17:100 – 94, 2010. URL https://api.semanticscholar.org/CorpusID:18924254.
- Language models for image captioning: The quirks and what works. In Chengqing Zong and Michael Strube (eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 100–105, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-2017. URL https://aclanthology.org/P15-2017.
- The curious decline of linguistic diversity: Training language models on synthetic text. CoRR, abs/2311.09807, 2023. doi: 10.48550/ARXIV.2311.09807. URL https://doi.org/10.48550/arXiv.2311.09807.
- Teaching machines to read and comprehend. ArXiv, abs/1506.03340, 2015. URL https://api.semanticscholar.org/CorpusID:6203757.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
- Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
- A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
- Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
- Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 605–612, Barcelona, Spain, July 2004. doi: 10.3115/1218955.1219032. URL https://aclanthology.org/P04-1077.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:257804696.
- The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256415991.
- Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42:381–392, 2010. URL https://api.semanticscholar.org/CorpusID:42852342.
- On decoding strategies for neural text generators. Trans. Assoc. Comput. Linguistics, 10:997–1012, 2022. URL https://transacl.org/ojs/index.php/tacl/article/view/3807.
- Locally typical sampling. Trans. Assoc. Comput. Linguistics, 11:102–121, 2023a. URL https://transacl.org/ojs/index.php/tacl/article/view/3993.
- Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023b. doi: 10.1162/tacl˙a˙00536. URL https://aclanthology.org/2023.tacl-1.7.
- Cross-task generalization via natural language crowdsourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.244. URL https://aclanthology.org/2022.acl-long.244.
- Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018. URL https://api.semanticscholar.org/CorpusID:215768182.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? ArXiv, abs/2309.05196, 2023a. URL https://api.semanticscholar.org/CorpusID:261682154.
- Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? arXiv preprint arXiv:2309.05196, 2023b.
- Investigating the representation of open domain dialogue context for transformer models. In Svetlana Stoyanchev, Shafiq Joty, David Schlangen, Ondrej Dusek, Casey Kennington, and Malihe Alikhani (eds.), Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 538–547, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.sigdial-1.50. URL https://aclanthology.org/2023.sigdial-1.50.
- Self-repetition in abstractive neural summarizers. In Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (AACL), volume 2022, pp. 341–350, 2022. URL https://aclanthology.org/2022.aacl-short.42.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
- On the importance of diversity in question generation for QA. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5651–5656, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.500. URL https://aclanthology.org/2020.acl-main.500.
- Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11626–11644, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.650. URL https://aclanthology.org/2023.acl-long.650.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Evaluating the evaluation of diversity in natural language generation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 326–346, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.25. URL https://aclanthology.org/2021.eacl-main.25.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a. URL https://api.semanticscholar.org/CorpusID:257219404.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Automated metrics for medical multi-document summarization disagree with human evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9871–9889, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.549. URL https://aclanthology.org/2023.acl-long.549.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
- Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Texygen: A benchmarking platform for text generation models. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018. URL https://api.semanticscholar.org/CorpusID:3636178.