The Rise of AI-Generated Content in Wikipedia (2410.08044v1)

Published 10 Oct 2024 in cs.CL

Abstract: The rise of AI-generated content in popular information sources raises significant concerns about accountability, accuracy, and bias amplification. Beyond directly impacting consumers, the widespread presence of this content poses questions for the long-term viability of training LLMs on vast internet sweeps. We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. Both detectors reveal a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates a significant increase in AI-generated Wikipedia content post GPT-3.5 using dual detection tools.
It employs both commercial (GPTZero) and open-source (Binoculars) methods to maintain a 1% false positive rate in its analysis.
Findings indicate that AI content often lacks comprehensive referencing and neutrality, raising concerns over quality and bias.

The Rise of AI-Generated Content in Wikipedia

The proliferation of AI-generated content in public information repositories such as Wikipedia represents a profound shift with multifaceted repercussions. This paper by Brooks, Eggert, and Peskoff addresses the extent to which AI-generated content is appearing on Wikipedia, leveraging advanced detection tools to provide a data-driven analysis.

Study Overview

By employing GPTZero, a proprietary detection tool, and Binoculars, an open-source alternative, the authors establish notable increases in AI-generated content on Wikipedia pages post GPT-3.5 release. Their analysis, based on articles created in August 2024, signals that over 5% of new English articles might contain AI-produced text. Intriguingly, similar observations were noted in other languages, albeit to a lesser extent, underlining the expansive reach of AI authorship.

Methodology and Tools

The paper operationalizes its analysis using two detection methods:

GPTZero: A commercial tool evaluating the probability of AI-generated content.
Binoculars: An open-source solution using cross-perplexity metrics via Falcon models, yielding lower bound estimates of AI-generated text prevalence.

Calibration was critical in maintaining false positive rates at 1% on pre-GPT-3.5 data. This meticulous approach ensures the robustness of their detection, though limitations in estimating false negatives acknowledge that actual AI content presence may be higher.

Key Findings

The paper discerns significant qualitative differences between AI-generated and traditional articles. AI-marked articles often lack comprehensive referencing, exhibit reduced integration in the broader Wikipedia framework, and frequently veer into promotion or unwarranted perspectives on contentious subjects. Through manual inspection, four predominant motivations for AI use are identified: self-promotion, polarization, translation errors, and AI as auxiliary in structured writing tasks.

Broader Context and Future Implications

The emergence of AI content in Wikipedia highlights broader implications:

Quality Assurance: AI-generated content often reflects lower quality, raising concerns about Wikipedia’s role as a reliable knowledge source.
Neutrality and Bias: AI-derived articles can promote biased viewpoints, challenging Wikipedia’s neutrality.
Model Training Risks: The presence of AI-generated content inadvertently poses risks to LLM training if these datasets feedback into subsequent AI models, potentially amplifying biases and inaccuracies.

The implications of AI-generated content stretch beyond Wikipedia:

Reddit and Press Releases: Initial explorations suggest varying AI usage across domains—minimal in Reddit comments but potentially more significant in localized UN press releases.

Concluding Remarks

This paper contributes a foundational understanding of AI's encroachment into public information sources. Future scholarship should continue exploring robust detection frameworks, particularly as AI authorship expands in scope and sophistication. As AI continues to redefine content creation landscapes, maintaining the integrity of widely trusted information sources like Wikipedia will be a critical challenge for researchers and practitioners alike.