Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 61 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard (2402.14533v1)

Published 22 Feb 2024 in cs.CL

Abstract: LLMs are capable of generating text that is similar to or surpasses human quality. However, it is unclear whether LLMs tend to exhibit distinctive linguistic styles akin to how human authors do. Through a comprehensive linguistic analysis, we compare the vocabulary, Part-Of-Speech (POS) distribution, dependency distribution, and sentiment of texts generated by three of the most popular LLMs today (GPT-3.5, GPT-4, and Bard) to diverse inputs. The results point to significant linguistic variations which, in turn, enable us to attribute a given text to its LLM origin with a favorable 88\% accuracy using a simple off-the-shelf classification model. Theoretical and practical implications of this intriguing finding are discussed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Conda: Contrastive domain adaptation for ai-generated text detection. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (2023), Association for Computational Linguistics, pp. 598–610.
  3. Can linguists distinguish between chatgpt/ai and human writing?: A study of research ethics and academic publishing. Research Methods in Applied Linguistics 2, 3 (2023), 100068.
  4. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), p. 785–794.
  5. Efficient detection of llm-generated texts with a bayesian surrogate model. arXiv (2023).
  6. Distinguishing academic science writing from humans or chatgpt with over 99% accuracy using off-the-shelf machine learning tools. Cell Reports Physical Science (2023).
  7. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Florence, Italy, July 2019), A. Korhonen, D. Traum, and L. Màrquez, Eds., Association for Computational Linguistics, pp. 3558–3567.
  8. A survey on the possibilities & impossibilities of ai-generated text detection. Transactions on Machine Learning Research (2023).
  9. I slept like a baby: using human traits to characterize deceptive chatgpt and human text. In International workshop on implicit author characterization from texts for search and retrieval (IACT’23) (2023).
  10. Llm censorship: A machine learning challenge or a computer security problem? arXiv (2023).
  11. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597 (2023).
  12. Writing through time: longitudinal studies of the effects of new technology on writing. British Journal of Educational Technology 32, 2 (2001), 141–151.
  13. Using new technology to assess the academic writing styles of male and female pairs and individuals. Journal of Technical Writing and Communication 33, 3 (2003), 243–261.
  14. A large-scale comparison of human-written versus chatgpt-generated essays. Scientific Reports 13, 1 (2023), 18617.
  15. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. arXiv (2023).
  16. A survey on stylometric text features. In 2019 25th Conference of Open Innovations Association (FRUCT) (2019), pp. 184–195.
  17. Multiple-attribute text rewriting. In International Conference on Learning Representations (2019).
  18. Authorship obfuscation in multilingual machine-generated text detection. arXiv preprint arXiv:2401.07867 (2024).
  19. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018 (2018), pp. 1941–1942.
  20. Voice in academic writing: The rhetorical construction of author identity in blind manuscript review. English for Specific Purposes 26, 2 (2007), 235–249.
  21. Chatgpt and bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Research 326 (2023), 115334.
  22. The linguistic features of writing quality. Written Communication 27 (2010), 57–86.
  23. Contrasting linguistic patterns in human and llm-generated text. arXiv preprint arXiv:2308.09067 (2023).
  24. Natural language processing: an introduction. Journal of the American Medical Informatics Association 18, 5 (2011), 544–551.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  26. Can ai-generated text be reliably detected? arXiv (2023).
  27. Computer aided functional style identification and correction in modern Russian texts. Journal of Data, Information and Management 4 (2022), 25–32.
  28. You’ve got style: Detecting writing flexibility across time. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge (2015), Association for Computing Machinery, p. 194–202.
  29. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv (2023).
  30. The science of detecting llm-generated texts. arXiv (2023).
  31. Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047 (2023).
  32. M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. arXiv preprint arXiv:2305.14902 (2023).
  33. Fake news in sheep’s clothing: Robust fake news detection against llm-empowered style attacks. arXiv (2023).
  34. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724 (2023).
  35. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal, Sept. 2015), L. Màrquez, C. Callison-Burch, and J. Su, Eds., Association for Computational Linguistics, pp. 2013–2018.
  36. MedDialog: Large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Online, Nov. 2020), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds., Association for Computational Linguistics, pp. 9241–9250.
  37. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Citations (5)

Summary

  • The paper achieves an 88% accurate attribution of LLM outputs by analyzing distinctive linguistic markers such as vocabulary usage and POS distribution.
  • It reveals key differences in lexical diversity and dependency relations, with GPT-4 showing higher lexical density and Bard exhibiting a broader POS range.
  • The study leverages statistical tests and an XGBoost model to provide practical insights for enhancing LLM evaluation and detection methodologies.

Linguistic Comparison and Attribution of LLMs: GPT-3.5, GPT-4, and Bard

Introduction

Recent developments in LLMs such as GPT-3.5, GPT-4, and Bard have advanced the field of NLP through human-like text generation. Despite their capabilities, there remains uncertainty regarding each model's distinctive linguistic style, a characteristic often overlooked in LLM comparative analysis. This paper provides a comprehensive paper of linguistic markers, specifically analyzing vocabulary, part-of-speech (POS) distribution, dependency relations, and sentiment across these LLMs. The findings reveal significant linguistic variations which inform an 88% accurate attribution of texts to their LLM of origin.

Methodology

Data Collection

The paper utilizes the LLM Comparison Corpus (LC2), derived from the Human ChatGPT Comparison Corpus (HC3), including 5,000 unique prompts. Each prompt is responded to by GPT-3.5, GPT-4, and Bard, totaling 15,000 LLM responses.

Analytical Framework

The analysis focuses on linguistic dimensions: vocabulary usage, POS distribution, dependency relations, and sentiment. Statistical methods such as ANOVA and Kolmogorov-Smirnov tests ascertain differences. An XGboost model further assesses the ability to attribute texts to their LLM source using these linguistic features.

Results

Vocabulary and POS

The analysis indicates that Bard typically generates shorter responses with less diverse vocabulary compared to GPT-3.5 and GPT-4. In contrast, GPT-4 demonstrates higher lexical density. Figure 1

Figure 1: Part-of-Speech (POS) distribution comparison between GPT-3.5, GPT-4, and Bard.

The POS distribution analysis reveals notable distinctions, especially in nouns and adjectives. GPT-3.5 uses nouns more frequently, while GPT-4 shows a higher prevalence of proper nouns. Bard, however, diverges significantly by employing a broader range of POS categories.

Dependency Relations

The paper examines key dependencies, underscoring Bard's unique linguistic style, marked by higher usage of frequent dependencies such as punctuation and auxiliaries. Contrarily, GPT-3.5 and GPT-4 differ minimally in this aspect. Figure 2

Figure 2: Top-30 dependency relations comparison between GPT-3.5, GPT-4, and Bard.

Sentiment Analysis

Sentiment analysis showcases a consistent inclination towards positive sentiment across all LLMs, with no significant inter-model variations. Figure 3

Figure 3: Sentiment distribution over the responses generated by GPT-3.5, GPT-4, and Bard.

LLM Attribution

Using the identified linguistic features, the XGboost model achieves an 88% attribution accuracy. Feature importance analysis prioritizes noun and proper noun occurrences, positive sentiment, and vocabulary density. Figure 4

Figure 4: Feature importance of the top-10 most important features of the XGboost model.

Discussion

The research affirms the presence of distinctive linguistic styles among LLMs, akin to human authors. This finding has implications for both theoretical understanding and practical applications, such as LLM evaluation and detection. The findings advocate for leveraging LLM-specific linguistic markers to enhance model detection accuracy and mitigate attribution challenges.

Conclusion

This paper underscores significant linguistic differences among popular LLMs, facilitating accurate attribution. These insights could inform future improvements in LLM design, enhancing their adaptability and application in varied linguistic contexts. The research opens avenues for further exploration of cross-linguistic and contextual variability in LLM outputs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube