Machines in the Crowd? Measuring the Footprint of Machine-Generated Text on Reddit

Published 8 Oct 2025 in cs.SI, cs.CL, cs.CY, and physics.soc-ph | (2510.07226v1)

Abstract: Generative Artificial Intelligence is reshaping online communication by enabling large-scale production of Machine-Generated Text (MGT) at low cost. While its presence is rapidly growing across the Web, little is known about how MGT integrates into social media environments. In this paper, we present the first large-scale characterization of MGT on Reddit. Using a state-of-the-art statistical method for detection of MGT, we analyze over two years of activity (2022-2024) across 51 subreddits representative of Reddit's main community types such as information seeking, social support, and discussion. We study the concentration of MGT across communities and over time, and compared MGT to human-authored text in terms of social signals it expresses and engagement it receives. Our very conservative estimate of MGT prevalence indicates that synthetic text is marginally present on Reddit, but it can reach peaks of up to 9% in some communities in some months. MGT is unevenly distributed across communities, more prevalent in subreddits focused on technical knowledge and social support, and often concentrated in the activity of a small fraction of users. MGT also conveys distinct social signals of warmth and status giving typical of language of AI assistants. Despite these stylistic differences, MGT achieves engagement levels comparable than human-authored content and in a few cases even higher, suggesting that AI-generated text is becoming an organic component of online social discourse. This work offers the first perspective on the MGT footprint on Reddit, paving the way for new investigations involving platform governance, detection strategies, and community dynamics.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a robust, zero-shot Fast-DetectGPT method to identify machine-generated text on Reddit.
It reveals that MGT prevalence peaks at up to 9% in certain communities and is concentrated among a small group of highly active users.
The study finds that MGT exhibits distinct stylistic traits and comparable engagement to human text, raising important moderation challenges.

Measuring the Footprint of Machine-Generated Text on Reddit

Introduction and Motivation

The proliferation of generative AI, particularly LLMs, has enabled large-scale production of machine-generated text (MGT) across online platforms. This paper presents a comprehensive, data-driven analysis of MGT on Reddit, focusing on its prevalence, temporal and community distribution, stylistic and social characteristics, and engagement dynamics. The study leverages a robust, zero-shot metric-based detection method (Fast-DetectGPT) to analyze over two years of Reddit activity (2022–2024) across 51 subreddits, providing a conservative yet reliable estimate of MGT’s integration into online discourse.

Methodology

Data Collection and Community Taxonomy

The analysis targets 51 subreddits, selected from the top 1,000 by subscriber count and mapped into five functional categories: Information Seeking, Social Support, Discussion, Identity, and ChitChat. The dataset comprises 38M comments and 4M submissions, filtered to include only texts with at least 250 tokens to ensure detection reliability.

MGT Detection

Detection is framed as a binary classification problem, with Fast-DetectGPT employed for its computational efficiency and strong performance on conversational data. The threshold for MGT classification is set at $\tau=0.99$ , prioritizing precision and minimizing false positives. This approach is justified by the lack of large-scale, Reddit-specific labeled data and the prohibitive cost of neural inference at Reddit scale.

Textual features (length, readability, compression) are computed, and social intent is quantified using a validated classifier for six social dimensions: knowledge, status, support, fun, conflict, and similarity. The classifier outputs are thresholded and length-normalized to mitigate bias from verbose texts.

Engagement Analysis

Engagement is measured as the difference between upvotes and downvotes. For each subreddit-month pair, engagement distributions for MGT and HGT are compared using bootstrap resampling and the Mann–Whitney U test, with effect sizes estimated via Cliff’s delta.

Prevalence and Temporal Dynamics of MGT

The study finds that, under conservative detection criteria, MGT is marginally present on Reddit overall but can reach peaks of up to 9% in certain communities and months. Adoption is highly uneven across subreddit categories:

Information Seeking and Social Support communities exhibit the highest and most consistent MGT adoption, with peaks in r/askscience (6.3%) and r/malefashionadvice (7.7%).
Discussion communities show minimal MGT presence, rarely exceeding 1.3%.
Identity communities, notably r/teenagers, display outlier behavior with sustained high MGT adoption (up to 8.5%).
ChitChat communities have the lowest MGT prevalence.

Temporal analysis reveals that MGT adoption surges coincide with major GenAI releases, particularly the launch of ChatGPT in November 2022 and subsequent LLMs in 2023–2024. After initial spikes, adoption stabilizes at a persistent, though low, baseline.

User-Level Adoption Patterns

The fraction of users producing MGT remains low, peaking at 2–3% of active users. However, among these users, the proportion of their output that is MGT is substantial, often 10–40%. This indicates that MGT production is concentrated among a small cohort of highly active users.

Figure 2: Average number of users adopting MGT and the corresponding percentage of their comments detected as MGT over time, filtered to exclude one-off users.

MGT comments are consistently longer, less readable, and more compressible than HGT across all subreddit categories. The difference in readability is especially pronounced in Identity communities, driven by low-quality MGT in r/teenagers.

Social dimension analysis reveals systematic stylistic differences:

MGT conveys higher levels of status and support, reflecting the assistive and polite style of instruction-tuned LLMs.
HGT is stronger in knowledge and similarity, especially in Information Seeking and Social Support communities, suggesting that human-authored content is more grounded in personal experience and community norms.
In Discussion communities, MGT is more knowledge- and status-oriented, while HGT exhibits higher support, conflict, and similarity, aligning with the dynamics of online debate.
Figure 1: Distribution of social dimension scores for MGT and HGT comments across subreddit categories, with effect sizes and significance annotations.

Engagement Dynamics

Engagement analysis demonstrates that, in most contexts, MGT achieves engagement levels comparable to HGT. In 26 subreddit-month pairs with statistically significant differences, MGT outperforms HGT in all but one case (r/worldnews, June 2023). The strongest positive effects are observed in Social Support (r/relationship_advice) and Discussion (r/politics) communities, particularly during periods of heightened activity (e.g., election cycles).

Community and Temporal Distribution

MGT prevalence is not uniform across communities or time. Information Seeking and Social Support subreddits are more susceptible to MGT, likely due to their Q&A and advice-oriented structures, which align with LLM capabilities. Identity and ChitChat communities show more sporadic or lower adoption, with notable exceptions among younger demographics.

Implications and Limitations

Practical Implications

The findings underscore the need for platform-level moderation and transparency policies, especially in communities where trust and authenticity are critical. The high engagement elicited by MGT, particularly in advice and political contexts, raises concerns about manipulation and the potential for representational imbalance.

Theoretical Implications

The integration of MGT into Reddit discourse is not merely additive; it introduces new communicative norms, particularly in the expression of social support and status. The co-evolution of human and AI-generated writing styles warrants further investigation, especially as GenAI tools become more deeply embedded in online social ecosystems.

Limitations

The analysis is restricted to Reddit and may not generalize to platforms with different content formats or community structures.
The conservative detection threshold likely underestimates true MGT prevalence.
The study focuses on a subset of active subreddits, which may not capture the full diversity of Reddit activity.
Engagement comparisons are based on subreddit-month matching and do not control for topic or author popularity.

Future Directions

Future research should extend MGT detection to short-form content, develop more granular matching for engagement analysis, and explore the long-term co-evolution of human and machine-generated discourse. Cross-platform studies are needed to assess the generalizability of these findings and to inform the design of robust detection and governance mechanisms.

Conclusion

This work provides the first large-scale, systematic characterization of MGT on Reddit, revealing that while overall prevalence remains modest, MGT is already an organic component of online discourse in certain communities. MGT exhibits distinct stylistic and social patterns, achieves engagement on par with or exceeding HGT in specific contexts, and is concentrated among a small subset of users. These findings have significant implications for platform governance, the evolution of online communication norms, and the broader societal impact of generative AI.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (3)

Collections

YouTube

Show All Videos

9% of Reddit is Machine-Generated (Cornell University) (10 points, 2 comments)

alphaXiv

Machines in the Crowd? Measuring the Footprint of Machine-Generated Text on Reddit (6 likes, 0 questions)

Machines in the Crowd? Measuring the Footprint of Machine-Generated Text on Reddit

Summary

Measuring the Footprint of Machine-Generated Text on Reddit

Introduction and Motivation

Methodology

Data Collection and Community Taxonomy

MGT Detection

Content and Social Signal Analysis

Engagement Analysis

Prevalence and Temporal Dynamics of MGT

User-Level Adoption Patterns

Stylistic and Social Characteristics of MGT

Engagement Dynamics

Community and Temporal Distribution

Implications and Limitations

Practical Implications

Theoretical Implications

Limitations

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

YouTube

Reddit

alphaXiv