Language-Agnostic Modeling of Source Reliability on Wikipedia (2410.18803v2)
Abstract: Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia's efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities.
- Pablo Aragón and Diego Sáez-Trumper. 2021. A preliminary approach to knowledge integrity risk assessment in Wikipedia projects. CoRR abs/2106.15940 (2021), 1–4. arXiv:2106.15940 https://arxiv.org/abs/2106.15940
- The Dynamics of (Not) Unfollowing Misinformation Spreaders. In Proceedings of the ACM on Web Conference 2024. 1115–1125.
- Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW ’23). ACM, New York, NY, USA, 2831–2839. https://doi.org/10.1145/3543507.3583218
- A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM, New York, NY, USA, 3743–3747. https://doi.org/10.1145/3583780.3615254
- Golding Barret. 2021. Iffy index of unreliable sources. (2021). https://iffy.news/index/ [accessed 2024 April 3].
- Societal Controversies in Wikipedia Articles. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). Association for Computing Machinery, New York, NY, USA, 193–196. https://doi.org/10.1145/2702123.2702436
- Citation detective: a public dataset to improve and quantify wikipedia citation quality at scale. Wiki Workshop.
- Joint Estimation of User And Publisher Credibility for Fake News Detection. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 1993–1996. https://doi.org/10.1145/3340531.3412066
- Noam Cohen. 2021. One Woman’s Mission to Rewrite Nazi History on Wikipedia. WIRED. [Online; accessed 15-Apr-2024].
- Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages. Proceedings of the International AAAI Conference on Web and Social Media 18, 01 (2024), 1–11.
- The spreading of misinformation online. Proceedings of the national academy of Sciences 113, 3 (2016), 554–559.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Introducing an “invisible enemy”: a case study of knowledge construction regarding microplastics in Japanese Wikipedia. New Media & Society 0, 0 (2023), 14614448221149747.
- Multilingual entity linking system for Wikipedia with a machine-in-the-loop approach. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. ACM, New York, NY, USA, 3818–3827.
- Interpretable Fake News Detection with Graph Evidence. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 659–668. https://doi.org/10.1145/3583780.3614936
- Edit-History Vis: An Interactive Visual Exploration and Analysis on Wikipedia Edit History. In 2023 IEEE 16th Pacific Visualization Symposium (PacificVis). IEEE, New York, NY, USA, 157–166.
- Scott A Hale. 2014. Multilinguals and Wikipedia editing. In Proceedings of the 2014 ACM conference on Web science. 99–108.
- Aaron Halfaker and R Stuart Geiger. 2020. ORES: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–37.
- Language-agnostic Topic Classification for Wikipedia. In Companion Proceedings of the Web Conference 2021 (WWW ’21). Association for Computing Machinery, New York, NY, USA, 594–601. https://doi.org/10.1145/3442442.3452347
- Governance Capture in a Self-Governing Community: A Qualitative Comparison of the Serbo-Croatian Wikipedias. arXiv preprint arXiv:2311.03616 (2023).
- From causes to consequences, from chat to crisis. The different climate changes of science and Wikipedia. Environmental Science & Policy 148 (2023), 103553. https://doi.org/10.1016/j.envsci.2023.103553
- Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. In Proceedings of the 25th International Conference on World Wide Web (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 591–602. https://doi.org/10.1145/2872427.2883085
- Templates and Trust-o-meters: Towards a widely deployable indicator of trust in Wikipedia. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–17.
- Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics. Computers 8, 3 (2019), 60.
- Modeling popularity and reliability of sources in multilingual Wikipedia. Information 11, 5 (2020), 263.
- Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Red Hook, NY, USA, 4765–4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
- Improving wikipedia verifiability with ai. Nature Machine Intelligence 5, 10 (2023), 1142–1148.
- Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability. In The World Wide Web Conference (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1567–1578. https://doi.org/10.1145/3308558.3313618
- Wikipedia: a self-organizing bureaucracy. Information, Communication & Society 26, 7 (2023), 1285–1302.
- Aaron Shaw and Benjamin M Hill. 2014. Laboratories of oligarchy? How the iron law extends to peer production. Journal of Communication 64, 2 (2014), 215–238.
- Studying Fake News via Network Analysis: Detection and Mitigation. In Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, Nitin Agarwal, Nima Dokoohaki, and Serpil Tokdemir (Eds.). Springer International Publishing, Cham, 43–65. https://doi.org/10.1007/978-3-319-94105-9_3
- Mining user-aware multi-relations for fake news detection in large scale online social networks. In Proceedings of the sixteenth ACM international conference on web search and data mining. 51–59.
- A commonsense-infused language-agnostic learning framework for enhancing prediction of political bias in multilingual news headlines. Knowledge-Based Systems 277 (2023), 110838.
- Fair Multilingual Vandalism Detection System for Wikipedia. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 4981–4990. https://doi.org/10.1145/3580305.3599823
- Mykola Trokhymovych and Diego Saez-Trumper. 2021. WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 4155–4164. https://doi.org/10.1145/3459637.3481961
- The spread of true and false news online. science 359, 6380 (2018), 1146–1151.
- Krzysztof W\kecel and Włodzimierz Lewoniewski. 2015. Modelling the quality of attributes in Wikipedia infoboxes. In Business Information Systems Workshops: BIS 2015 International Workshops, Poznań, Poland, June 24-26, 2015, Revised Papers 18. Springer International Publishing, Cham, 308–320.
- Wikipedia contributors. 2024. English Wikipedia. 2024. Wikipedia:Reliable sources/Perennial sources. https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources/Perennial_sources [Online; accessed 12-April-2024].
- The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 1 (2016), 1–9.
- Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 2726–2736. https://doi.org/10.1145/3583780.3615015
- MSynFD: Multi-hop Syntax aware Fake News Detection. In Proceedings of the ACM on Web Conference 2024. 4128–4137.
- HiPo: Detecting Fake News via Historical and Multi-Modal Analyses of Social Media Posts. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 2805–2815. https://doi.org/10.1145/3583780.3614914
- Identifying cost-effective debunkers for multi-stage fake news mitigation campaigns. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1206–1214.
- Puyu Yang and Giovanni Colavizza. 2024. Polarization and reliability of news sources in Wikipedia. Online Information Review ahead-of-print, ahead-of-print (2024), 1–18.
- Junting Ye and Steven Skiena. 2019. MediaRank: Computational Ranking of Online News Sources. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2469–2477. https://doi.org/10.1145/3292500.3330709
- Dave Van Zandt. 2024. Media Bias/Fact-Check. (2024). https://mediabiasfactcheck.com/ [accessed 2024 April 23].
- M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., New York, 5484–5505. https://proceedings.neurips.cc/paper_files/paper/2023/file/117c5c8622b0d539f74f6d1fb082a2e9-Paper-Datasets_and_Benchmarks.pdf
- Don’t trust ChatGPT when your question is not in English: A study of multilingual abilities and types of LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 7915–7927.
- Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches. Information Processing & Management 58, 1 (2021), 102390.
- Gender and country biases in Wikipedia citations to scholarly publications. Journal of the Association for Information Science and Technology 74, 2 (2023), 219–233.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.