- The paper introduces a robust, zero-shot Fast-DetectGPT method to identify machine-generated text on Reddit.
- It reveals that MGT prevalence peaks at up to 9% in certain communities and is concentrated among a small group of highly active users.
- The study finds that MGT exhibits distinct stylistic traits and comparable engagement to human text, raising important moderation challenges.
Introduction and Motivation
The proliferation of generative AI, particularly LLMs, has enabled large-scale production of machine-generated text (MGT) across online platforms. This paper presents a comprehensive, data-driven analysis of MGT on Reddit, focusing on its prevalence, temporal and community distribution, stylistic and social characteristics, and engagement dynamics. The study leverages a robust, zero-shot metric-based detection method (Fast-DetectGPT) to analyze over two years of Reddit activity (2022–2024) across 51 subreddits, providing a conservative yet reliable estimate of MGT’s integration into online discourse.
Methodology
Data Collection and Community Taxonomy
The analysis targets 51 subreddits, selected from the top 1,000 by subscriber count and mapped into five functional categories: Information Seeking, Social Support, Discussion, Identity, and ChitChat. The dataset comprises 38M comments and 4M submissions, filtered to include only texts with at least 250 tokens to ensure detection reliability.
MGT Detection
Detection is framed as a binary classification problem, with Fast-DetectGPT employed for its computational efficiency and strong performance on conversational data. The threshold for MGT classification is set at τ=0.99, prioritizing precision and minimizing false positives. This approach is justified by the lack of large-scale, Reddit-specific labeled data and the prohibitive cost of neural inference at Reddit scale.
Content and Social Signal Analysis
Textual features (length, readability, compression) are computed, and social intent is quantified using a validated classifier for six social dimensions: knowledge, status, support, fun, conflict, and similarity. The classifier outputs are thresholded and length-normalized to mitigate bias from verbose texts.
Engagement Analysis
Engagement is measured as the difference between upvotes and downvotes. For each subreddit-month pair, engagement distributions for MGT and HGT are compared using bootstrap resampling and the Mann–Whitney U test, with effect sizes estimated via Cliff’s delta.
Prevalence and Temporal Dynamics of MGT
The study finds that, under conservative detection criteria, MGT is marginally present on Reddit overall but can reach peaks of up to 9% in certain communities and months. Adoption is highly uneven across subreddit categories:
- Information Seeking and Social Support communities exhibit the highest and most consistent MGT adoption, with peaks in r/askscience (6.3%) and r/malefashionadvice (7.7%).
- Discussion communities show minimal MGT presence, rarely exceeding 1.3%.
- Identity communities, notably r/teenagers, display outlier behavior with sustained high MGT adoption (up to 8.5%).
- ChitChat communities have the lowest MGT prevalence.
Temporal analysis reveals that MGT adoption surges coincide with major GenAI releases, particularly the launch of ChatGPT in November 2022 and subsequent LLMs in 2023–2024. After initial spikes, adoption stabilizes at a persistent, though low, baseline.
User-Level Adoption Patterns
The fraction of users producing MGT remains low, peaking at 2–3% of active users. However, among these users, the proportion of their output that is MGT is substantial, often 10–40%. This indicates that MGT production is concentrated among a small cohort of highly active users.
Figure 2: Average number of users adopting MGT and the corresponding percentage of their comments detected as MGT over time, filtered to exclude one-off users.
Stylistic and Social Characteristics of MGT
MGT comments are consistently longer, less readable, and more compressible than HGT across all subreddit categories. The difference in readability is especially pronounced in Identity communities, driven by low-quality MGT in r/teenagers.
Social dimension analysis reveals systematic stylistic differences:
Engagement Dynamics
Engagement analysis demonstrates that, in most contexts, MGT achieves engagement levels comparable to HGT. In 26 subreddit-month pairs with statistically significant differences, MGT outperforms HGT in all but one case (r/worldnews, June 2023). The strongest positive effects are observed in Social Support (r/relationship_advice) and Discussion (r/politics) communities, particularly during periods of heightened activity (e.g., election cycles).
Community and Temporal Distribution
MGT prevalence is not uniform across communities or time. Information Seeking and Social Support subreddits are more susceptible to MGT, likely due to their Q&A and advice-oriented structures, which align with LLM capabilities. Identity and ChitChat communities show more sporadic or lower adoption, with notable exceptions among younger demographics.
Implications and Limitations
Practical Implications
The findings underscore the need for platform-level moderation and transparency policies, especially in communities where trust and authenticity are critical. The high engagement elicited by MGT, particularly in advice and political contexts, raises concerns about manipulation and the potential for representational imbalance.
Theoretical Implications
The integration of MGT into Reddit discourse is not merely additive; it introduces new communicative norms, particularly in the expression of social support and status. The co-evolution of human and AI-generated writing styles warrants further investigation, especially as GenAI tools become more deeply embedded in online social ecosystems.
Limitations
- The analysis is restricted to Reddit and may not generalize to platforms with different content formats or community structures.
- The conservative detection threshold likely underestimates true MGT prevalence.
- The study focuses on a subset of active subreddits, which may not capture the full diversity of Reddit activity.
- Engagement comparisons are based on subreddit-month matching and do not control for topic or author popularity.
Future Directions
Future research should extend MGT detection to short-form content, develop more granular matching for engagement analysis, and explore the long-term co-evolution of human and machine-generated discourse. Cross-platform studies are needed to assess the generalizability of these findings and to inform the design of robust detection and governance mechanisms.
Conclusion
This work provides the first large-scale, systematic characterization of MGT on Reddit, revealing that while overall prevalence remains modest, MGT is already an organic component of online discourse in certain communities. MGT exhibits distinct stylistic and social patterns, achieves engagement on par with or exceeding HGT in specific contexts, and is concentrated among a small subset of users. These findings have significant implications for platform governance, the evolution of online communication norms, and the broader societal impact of generative AI.