Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages (2404.09764v1)

Published 15 Apr 2024 in cs.CY

Abstract: Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. IRVILAB: Gamified Searching on Multilingual Wikipedia. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3329–3333. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  2. Automatically assessing the quality of Wikipedia contents. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 804–807.
  3. Visual Gender Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken Languages. Proceedings of the International AAAI Conference on Web and Social Media, 16(1): 43–54.
  4. GERE: Generative Evidence Retrieval for Fact Verification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 2184–2189. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  5. WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, 598–607.
  6. Measuring quality of collaboratively edited documents: The case of Wikipedia. In 2016 IEEE 2nd international conference on collaboration and internet computing (CIC), 266–275. IEEE.
  7. Quality change: Norm or exception? Measurement, analysis and detection of quality change in Wikipedia. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW1): 1–36.
  8. Wikimarks: Harvesting Relevance Benchmarks from Wikipedia. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3003–3012. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  9. TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia. In Proceedings of the International AAAI Conference on Web and Social Media, volume 11, 408–417.
  10. Datasheets for datasets. Communications of the ACM, 64(12): 86–92.
  11. Spam Users Identification in Wikipedia Via Editing Behavior. Proceedings of the International AAAI Conference on Web and Social Media, 11(1): 532–535.
  12. NwQM: A neural quality assessment framework for Wikipedia. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8396–8406.
  13. Halfaker, A. 2017. Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. In Proceedings of the 13th International Symposium on Open Collaboration, OpenSym ’17. New York, NY, USA: Association for Computing Machinery. ISBN 9781450351874.
  14. ORES: Lowering barriers with participatory machine learning in Wikipedia. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2): 1–37.
  15. A Dataset for Sentence Retrieval for Open-Ended Dialogues. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 2960–2969. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  16. Automatic quality assessment of content created collaboratively by web communities: a case study of Wikipedia. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, 295–304.
  17. The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, 291–300. New York, NY, USA: Association for Computing Machinery. ISBN 9781605589299.
  18. Language-agnostic Topic Classification for Wikipedia. In Companion Proceedings of the Web Conference 2021, 594–601.
  19. Considerations for Multilingual Wikipedia Research. arXiv preprint arXiv:2204.02483.
  20. The Gender Divide in Wikipedia: Quantifying and Assessing the Impact of Two Feminist Interventions. Journal of Communication, 72(3): 297–321.
  21. An Auto Encoder-based Dimensionality Reduction Technique for Efficient Entity Linking in Business Phone Conversations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3363–3367. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  22. ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3108–3120. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  23. Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics. Computers, 8(3): 60.
  24. Meier, F. 2022. TWikiL – the Twitter Wikipedia Link Dataset. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, 1292–1301.
  25. Advances in Pre-Training Distributed Word Representations. arXiv preprint arXiv:1712.09405.
  26. Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. In Proceedings of the International AAAI Conference on Web and Social Media, volume 13, 620–629.
  27. Model Cards for Model Reporting. In Proceedings of the conference on fairness, accountability, and transparency, 220–229.
  28. WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, 878–884.
  29. Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review. ACM Computing Surveys, 56(4): 1–37.
  30. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. arXiv preprint arXiv:2010.02353.
  31. Crosslingual Topic Modeling with WikiPDA. In Proceedings of the Web Conference 2021, WWW ’21, 3032–3041. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383127.
  32. Biographical Semi-Supervised Relation Extraction Dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3121–3130. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  33. Sefidari, M. 2022. 20 years of the encyclopaedia anyone can edit: Wikipedia and the pursuit of knowledge equity. BiD: textos universitaris de biblioteconomia i documentació, 47.
  34. A hybrid model for quality assessment of Wikipedia articles. In Proceedings of the Australasian Language Technology Association Workshop 2017, 43–52.
  35. A joint model for multimodal document quality assessment. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 107–110. IEEE.
  36. Why We Read Wikipedia. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, 1591–1600. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. ISBN 9781450349130.
  37. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, 2443–2449. New York, NY, USA: Association for Computing Machinery. ISBN 9781450380379.
  38. Tracking Knowledge Propagation Across Wikipedia Languages. In ICWSM, 1046–1052.
  39. ArchivalQA: A Large-Scale Benchmark Dataset for Open-Domain Question Answering over Historical News Collections. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 3025–3035. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  40. Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, 1–10.
  41. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1): 1–9.
  42. Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia. arXiv preprint arXiv:2105.04117.
  43. C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 2507–2512. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  44. Towards Explainable Search Results: A Listwise Explanation Generator. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 669–680. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
  45. Mining and Predicting Temporal Patterns in the Quality Evolution of Wikipedia Articles. In HICSS, 1–10.
  46. History-based article quality assessment on Wikipedia. In 2018 IEEE international conference on big data and smart computing (BigComp), 1–8. IEEE.
  47. Enhancing Zero-Shot Stance Detection via Targeted Background Knowledge. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, 2070–2075. New York, NY, USA: Association for Computing Machinery. ISBN 9781450387323.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Paramita Das (12 papers)
  2. Isaac Johnson (22 papers)
  3. Diego Saez-Trumper (22 papers)
  4. Pablo Aragón (14 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com