Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 61 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 193 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard (2402.14533v1)

Published 22 Feb 2024 in cs.CL

Abstract: LLMs are capable of generating text that is similar to or surpasses human quality. However, it is unclear whether LLMs tend to exhibit distinctive linguistic styles akin to how human authors do. Through a comprehensive linguistic analysis, we compare the vocabulary, Part-Of-Speech (POS) distribution, dependency distribution, and sentiment of texts generated by three of the most popular LLMs today (GPT-3.5, GPT-4, and Bard) to diverse inputs. The results point to significant linguistic variations which, in turn, enable us to attribute a given text to its LLM origin with a favorable 88\% accuracy using a simple off-the-shelf classification model. Theoretical and practical implications of this intriguing finding are discussed.

References (37)

Citations (5)

View on Semantic Scholar

Summary

The paper achieves an 88% accurate attribution of LLM outputs by analyzing distinctive linguistic markers such as vocabulary usage and POS distribution.
It reveals key differences in lexical diversity and dependency relations, with GPT-4 showing higher lexical density and Bard exhibiting a broader POS range.
The study leverages statistical tests and an XGBoost model to provide practical insights for enhancing LLM evaluation and detection methodologies.

Linguistic Comparison and Attribution of LLMs: GPT-3.5, GPT-4, and Bard

Introduction

Recent developments in LLMs such as GPT-3.5, GPT-4, and Bard have advanced the field of NLP through human-like text generation. Despite their capabilities, there remains uncertainty regarding each model's distinctive linguistic style, a characteristic often overlooked in LLM comparative analysis. This paper provides a comprehensive paper of linguistic markers, specifically analyzing vocabulary, part-of-speech (POS) distribution, dependency relations, and sentiment across these LLMs. The findings reveal significant linguistic variations which inform an 88% accurate attribution of texts to their LLM of origin.

Methodology

Data Collection

The paper utilizes the LLM Comparison Corpus (LC2), derived from the Human ChatGPT Comparison Corpus (HC3), including 5,000 unique prompts. Each prompt is responded to by GPT-3.5, GPT-4, and Bard, totaling 15,000 LLM responses.

Analytical Framework

The analysis focuses on linguistic dimensions: vocabulary usage, POS distribution, dependency relations, and sentiment. Statistical methods such as ANOVA and Kolmogorov-Smirnov tests ascertain differences. An XGboost model further assesses the ability to attribute texts to their LLM source using these linguistic features.

Results

Vocabulary and POS

The analysis indicates that Bard typically generates shorter responses with less diverse vocabulary compared to GPT-3.5 and GPT-4. In contrast, GPT-4 demonstrates higher lexical density.

Figure 1: Part-of-Speech (POS) distribution comparison between GPT-3.5, GPT-4, and Bard.

The POS distribution analysis reveals notable distinctions, especially in nouns and adjectives. GPT-3.5 uses nouns more frequently, while GPT-4 shows a higher prevalence of proper nouns. Bard, however, diverges significantly by employing a broader range of POS categories.

Dependency Relations

The paper examines key dependencies, underscoring Bard's unique linguistic style, marked by higher usage of frequent dependencies such as punctuation and auxiliaries. Contrarily, GPT-3.5 and GPT-4 differ minimally in this aspect.

Figure 2: Top-30 dependency relations comparison between GPT-3.5, GPT-4, and Bard.

Sentiment Analysis

Sentiment analysis showcases a consistent inclination towards positive sentiment across all LLMs, with no significant inter-model variations.

Figure 3: Sentiment distribution over the responses generated by GPT-3.5, GPT-4, and Bard.

LLM Attribution

Using the identified linguistic features, the XGboost model achieves an 88% attribution accuracy. Feature importance analysis prioritizes noun and proper noun occurrences, positive sentiment, and vocabulary density.

Figure 4: Feature importance of the top-10 most important features of the XGboost model.

Discussion

The research affirms the presence of distinctive linguistic styles among LLMs, akin to human authors. This finding has implications for both theoretical understanding and practical applications, such as LLM evaluation and detection. The findings advocate for leveraging LLM-specific linguistic markers to enhance model detection accuracy and mitigate attribution challenges.

Conclusion

This paper underscores significant linguistic differences among popular LLMs, facilitating accurate attribution. These insights could inform future improvements in LLM design, enhancing their adaptability and application in varied linguistic contexts. The research opens avenues for further exploration of cross-linguistic and contextual variability in LLM outputs.