This paper investigates the coevolutionary relationship between humans and LLMs by analyzing word frequency trends in arXiv paper abstracts (Geng et al., 13 Feb 2025 ). The core finding is that the usage frequency of certain words previously identified as characteristic of LLM output (e.g., "delve", "intricate", "realm", "showcasing") experienced a noticeable drop starting around April 2024. This timing coincides with publications that highlighted these specific words as overused by LLMs like ChatGPT. Conversely, the frequency of other LLM-favored words, such as "significant", has continued to increase.
The analysis utilizes metadata and abstracts from over 1.29 million arXiv papers submitted between 2018 and 2024, sourced from a Kaggle dataset, along with data on withdrawn arXiv papers from the WithdrarXiv dataset. Word frequencies were calculated monthly and normalized per 10,000 abstracts. The authors compare trends across all disciplines and specifically between Computer Science (cs) and other fields, noting that words becoming more frequent in CS abstracts also increased in other disciplines, suggesting widespread LLM influence.
The authors argue that the decrease in specific word frequencies isn't primarily due to LLM updates (like GPT-4o) but rather a conscious effort by authors to avoid terms that have become associated with LLM generation, suggesting adaptation and modification of LLM output by human users. The continued rise of less conspicuous words like "significant" indicates that LLM influence persists, potentially in more subtle ways.
The paper further explores the challenges in detecting machine-generated text (MGT). Experiments involved revising arXiv abstracts from 2018-2025 using GPT-4o-mini with two prompts: a simple revision prompt (P1) and a prompt instructing the LLM to avoid specific flagged words (P2). While LLM revision increased the frequency of certain target words (P1), the avoidance prompt (P2) reduced but didn't eliminate them. Applying a state-of-the-art MGT detector (Binoculars) showed minimal difference in detection scores between original abstracts across different years and between original and LLM-revised abstracts, even those fully processed by an LLM. The results also varied depending on the prompt used, casting doubt on the robustness of current MGT detection methods in real-world scenarios where text might be a mix of human and machine writing, or where users actively edit LLM output.
The authors conclude that human authors are adapting to LLM use, making MGT detection increasingly difficult, especially on a per-text basis. They suggest that statistical analysis of word frequencies across large corpora, focusing on common words whose usage patterns are subtly shifting (like the observed decrease in "is" and "are"), remains a more viable approach for estimating the overall impact of LLMs on academic writing, rather than relying on detectors for individual texts. This coevolution highlights the long-term and potentially less obvious integration of LLMs into academic practices.