Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study (2402.15518v2)

Published 11 Feb 2024 in cs.CL

Abstract: The performance of conversational LLMs in general, and of ChatGPT in particular, is currently being evaluated on many different tasks, from logical reasoning or maths to answering questions on a myriad of topics. Instead, much less attention is being devoted to the study of the linguistic features of the texts generated by these LLMs. This is surprising since LLMs are models for language, and understanding how they use the language is important. Indeed, conversational LLMs are poised to have a significant impact on the evolution of languages as they may eventually dominate the creation of new text. This means that for example, if conversational LLMs do not use a word it may become less and less frequent and eventually stop being used altogether. Therefore, evaluating the linguistic features of the text they produce and how those depend on the model parameters is the first step toward understanding the potential impact of conversational LLMs on the evolution of languages. In this paper, we consider the evaluation of the lexical richness of the text generated by LLMs and how it depends on the model parameters. A methodology is presented and used to conduct a comprehensive evaluation of lexical richness using ChatGPT as a case study. The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model. The dataset and tools used in our analysis are released under open licenses with the goal of drawing the much-needed attention to the evaluation of the linguistic features of LLM-generated text.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. On the Stability of Iterative Retraining of Generative Models on their own Data. arXiv:2310.00429 [cs.LG]
  2. Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop. arXiv:2311.16822 [cs.LG]
  3. Michael A. Covington and Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR). Journal of Quantitative Linguistics 17, 2 (2010), 94–100. https://doi.org/10.1080/09296171003643098 arXiv:https://doi.org/10.1080/09296171003643098
  4. Mistral 7B. arXiv:2310.06825 [cs.CL]
  5. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
  6. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  7. ELI5: Long form question answering. arXiv preprint arXiv:1907.09190 (2019).
  8. Annette Gerstenberg. 2015. A Sociolinguistic Perspective on Vocabulary Richness in a seven-year Comparison of Older Adults. 109–127. https://doi.org/10.1075/impact.37.06ger
  9. Determining the optimal temperature parameter for Softmax function in reinforcement learning. Applied Soft Computing 70 (2018), 80–85. https://doi.org/10.1016/j.asoc.2018.05.012
  10. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
  11. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]
  12. Jose Hernandez-Orallo. 2020. AI evaluation: On broken yardsticks and measurement scales. In Workshop on Evaluating Evaluation of Ai Systems at AAAI.
  13. The Curious Case of Neural Text Degeneration. arXiv:1904.09751 [cs.CL]
  14. Roeland Hout and Anne Vermeer. 2007. Comparing measures of lexical richness. In: H. Daller, J. Milton and J. Treffers-Daller (eds.), Modelling and assessing vocabulary knowledge (93-116). Cambridge: Cambridge University Press. (01 2007).
  15. Victoria Johansson. 2008. Lexical diversity and lexical density in speech and writing. WorkingPaper. The information about affiliations in this record was updated in December 2015. The record was previously connected to the following departments: Linguistics and Phonetics (015010003).
  16. The influence of social distance on speech behavior: Formality variation in casual speech. Corpus Linguistics and Linguistic Theory 15 (01 2017). https://doi.org/10.1515/cllt-2016-0056
  17. W. Labov. 1972. Sociolinguistic Patterns. University of Pennsylvania Press.
  18. Theresa Lillis and Carolyn McKinney. 2013. The sociolinguistics of writing in a global context: Objects, lenses, consequences. Journal of Sociolinguistics 17, 4 (2013), 415–439. https://doi.org/10.1111/josl.12046 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/josl.12046
  19. Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet. arXiv:2306.06130 [cs.AI]
  20. Philip M. McCarthy and Scott Jarvis. 2007. vocd: A theoretical and empirical evaluation. Language Testing 24, 4 (2007), 459–488. https://doi.org/10.1177/0265532207080767 arXiv:https://doi.org/10.1177/0265532207080767
  21. Philip M. McCarthy and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42 (2010), 381–392.
  22. Contrasting Linguistic Patterns in Human and LLM-Generated Text. arXiv preprint arXiv:2308.09067 (2023).
  23. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  24. Playing with Words: Comparing the Vocabulary and Lexical Richness of ChatGPT and Humans. arXiv:2308.07462 [cs.CL]
  25. Christelle Robert and Liliana Rico Duarte. 2015. Semantic Richness and Aging: The Effect of Number of Features in the Lexical Decision Task. Journal of Psycholinguistic Research 45 (2015), 359 – 365.
  26. Lucas Shen. 2022. LexicalRichness: A small module to compute textual lexical richness. https://doi.org/10.5281/zenodo.6607007
  27. Yaqian Shi and Lei Lei. 2021. Lexical use and social class: A study on lexical richness, word length, and word class in spoken English. Lingua 262 (2021), 103155. https://doi.org/10.1016/j.lingua.2021.103155
  28. Sameer Singh. 2001. A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures. Literary and Linguistic Computing 16, 3 (09 2001), 251–264. https://doi.org/10.1093/llc/16.3.251 arXiv:https://academic.oup.com/dsh/article-pdf/16/3/251/10889545/251.pdf
  29. Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]
  30. Juan Manuel Toro. 2023. Emergence of a phonological bias in ChatGPT. arXiv preprint arXiv:2305.15929 (2023).
  31. Fiona J Tweedie and R Harald Baayen. 1998. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 32 (1998), 323–352.
  32. Ronald Wardhaugh and Janet Fuller. 2014. An Introduction to Sociolinguistics (7 ed.). Wiley.
  33. HellaSwag: Can a Machine Really Finish Your Sentence?. In Annual Meeting of the Association for Computational Linguistics.
  34. Kyrie Zhixuan Zhou and Madelyn Rose Sanfilippo. 2023. Public perceptions of gender bias in large language models: Cases of chatgpt and ernie. arXiv preprint arXiv:2309.09120 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Gonzalo Martínez (18 papers)
  2. José Alberto Hernández (25 papers)
  3. Javier Conde (28 papers)
  4. Pedro Reviriego (36 papers)
  5. Elena Merino (1 paper)
Citations (3)