Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites (2305.09820v5)

Published 16 May 2023 in cs.CY, cs.LG, and cs.SI

Abstract: As LLMs like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these LLMs produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. The Web Never Forgets: Persistent Tracking Mechanisms in the Wild. In ACM Conference on Computer and Communications Security.
  2. AI, O. 2022. ChatGPT: Optimizing Language Models for Dialogue. http://web.archive.org/web/20230109000707/https://openai.com/blog/chatgpt/.
  3. Alba, D. 2023. AI Chatbots Have Been Used to Create Dozens of News Content Farms - Bloomberg. https://www.bloomberg.com/news/articles/2023-05-01/ai-chatbots-have-been-used-to-create-dozens-of-news-content-farms.
  4. Barret Golding. 2022. Iffy Index of Unreliable Sources. https://iffy.news/index/.
  5. The pushshift reddit dataset. In AAAI conference on web and social media.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33.
  7. Facebook AI’s WMT20 News Translation Task Submission. In Proceedings of the Fifth Conference on Machine Translation, 113–125.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. What do a Million News Articles Look like? In Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, March 20, 2016., 42–47.
  10. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In 57th Annual Meeting of the Assoc. for Computational Linguistics.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  12. Emily R. Lowe and Katrina Slack. 2022. Data Scraping Deemed Legal in Certain Circumstances. https://www.morganlewis.com/blogs/sourcingatmorganlewis/2022/04/data-scraping-deemed-legal-in-certain-circumstances.
  13. Robustness analysis of grover for machine-generated news detection. In 19th Annual Workshop of the Australasian Language Technology Association.
  14. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
  15. A Golden Age: Conspiracy Theories’ Relationship with Misinformation Outlets, News Media, and the Wider Internet. ACM Computer-Supported Cooperative Work And Social Computing.
  16. DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations.
  17. MGTBench: Benchmarking Machine-Generated Text Detection. arXiv preprint arXiv:2303.14822.
  18. Identifying Disinformation Websites Using Infrastructure Features. In USENIX Workshop on Free and Open Communications on the Internet.
  19. Hu, K. 2023. ChatGPT sets record for fastest-growing user base - analyst note — Reuters.
  20. Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  21. Jack, C. 2017. Lexicon of lies: Terms for problematic information. Data & Society, 3(22): 1094–1096.
  22. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  23. New AI classifier for indicating AI-written text. OpenAI blog.
  24. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408.
  25. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
  26. Leffer, L. 2023. CNET’s AI-Written Articles Are Riddled With Errors. https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151.
  27. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In 60th Annual Meeting of the Assoc. for Computational Linguistics.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  30. Facebook FAIR’s WMT19 News Translation Task Submission. In Fourth Conference on Machine Translation.
  31. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  32. Peiser, J. 2019. The Rise of the Robot Reporter - The New York Times. https://www.nytimes.com/2019/02/05/business/media/artificial-intelligence-journalism-robots.html.
  33. Deepfake Text Detection: Limitations and Opportunities. arXiv preprint arXiv:2210.09421.
  34. Improving Language Understanding by Generative Pre-Training. https://cdn.openai.com/research-covers/language-unsupervised/language˙understanding˙paper.pdf.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1).
  37. Toppling top lists: Evaluating the accuracy of popular website lists. In ACM Internet Measurement Conference.
  38. Can AI-Generated Text be Reliably Detected? arXiv preprint arXiv:2303.11156.
  39. Rise of the Newsbots: AI-Generated News Websites Proliferating Online. https://www.newsguardtech.com/special-reports/newsbots-ai-generated-news-websites-proliferating/.
  40. Schappert, S. 2023. Twitter blocks non-users from reading tweets over AI data scraping. https://cybernews.com/news/twitter-blocks-non-users-reading-tweets-ai-scraping/.
  41. A review on web scrapping and its applications. In 2019 international conference on computer communication and informatics (ICCCI), 1–6. IEEE.
  42. Dirt cheap web-scale parallel text from the common crawl. Association for Computational Linguistics.
  43. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.
  44. Szpakowski, M. 2020. Fake News Corpus. https://github.com/several27/FakeNewsCorpus/.
  45. The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205.
  46. Authorship attribution for neural text generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  47. TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. In Findings of the Association for Computational Linguistics: EMNLP.
  48. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. arXiv preprint arXiv:2306.07899.
  49. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  50. Defending Against Neural Fake News. Advances in Neural Information Processing Systems, 32.
  51. Zhang, G. P. 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50.
  52. Neural Deepfake Detection with Factual Structure of Text. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Citations (25)

Summary

We haven't generated a summary for this paper yet.