Crowdsourcing Lexical Diversity (2410.23133v1)
Abstract: Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emph{lexical gaps}, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
- Ken Albala. 2011. Food Cultures of the World Encyclopedia:[4 Volumes]. Bloomsbury Publishing USA.
- Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv preprint arXiv:2307.02179 101 (2023).
- Febe Armanios and Bogac Ergene. 2018. Halal food: A history. Oxford University Press.
- Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational linguistics 34, 4 (2008), 555–596.
- Food and cultural studies. Routledge.
- WN-BERT: Integrating wordnet and BERT for lexical semantics in natural language understanding. Computational Linguistics in the Netherlands Journal 11 (2021), 105–124.
- WordNet: A lexical database organized on psycholinguistic principles. In Lexical Acquisition. Psychology Press, 211–232.
- Language Diversity: Visible to Humans, Exploitable by Machines. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Valerio Basile, Zornitsa Kozareva, and Sanja Stajner (Eds.). Association for Computational Linguistics, Dublin, Ireland, 156–165. https://doi.org/10.18653/v1/2022.acl-demo.15
- Tackling Language Modelling Bias in Support of Linguistic Diversity. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (Rio de Janeiro, Brazil) (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 562–572. https://doi.org/10.1145/3630106.3658925
- A Major Wordnet for a Minority Language: Scottish Gaelic. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2812–2818. https://aclanthology.org/2020.lrec-1.342
- Martin Benjamin and Paula Radetzky. 2014. Multilingual lexicography with a focus on less-resourced languages: Data mining, expert input, crowdsourcing, and gamification. In 9th edition of the Language Resources and Evaluation Conference.
- Chris Biemann and Valerie Nygaard. 2010. Crowdsourcing wordnet. In The 5th International Conference of the Global WordNet Association (GWC-2010).
- Francis Bond and Ryan Foster. 2013. Linking and Extending an Open Multilingual Wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Hinrich Schuetze, Pascale Fung, and Massimo Poesio (Eds.). Association for Computational Linguistics, Sofia, Bulgaria, 1352–1362. https://aclanthology.org/P13-1133
- Richard A Brualdi. 2004. Introductory combinatorics. Pearson Education India.
- NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13745–13818. https://doi.org/10.18653/v1/2023.findings-acl.868
- Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Philipp Koehn and Rada Mihalcea (Eds.). Association for Computational Linguistics, Singapore, 286–295. https://aclanthology.org/D09-1030
- DiBiMT: A Novel Benchmark for Measuring Word Sense Disambiguation Biases in Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 4331–4352. https://doi.org/10.18653/v1/2022.acl-long.298
- IndoUKC: A Concept-Centered Indian Multilingual Lexical Resource. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2833–2840. https://aclanthology.org/2022.lrec-1.303
- It’s About Time: A View of Crowdsourced Data Before and During the Pandemic. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 649, 14 pages. https://doi.org/10.1145/3411764.3445317
- Jaka Čibej and Špela Arhar Holdt. 2019. Repel the syntruders! A crowdsourcing cleanup of the thesaurus of modern Slovene. In Proceedings of the ELex 2019 Conference: Electronic lexicography in the 21st century, Sintra, Portugal.
- The WordNet in Indian Languages (1st ed.). Springer Singapore.
- Is GPT-3 a Good Data Annotator?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 11173–11195. https://doi.org/10.18653/v1/2023.acl-long.626
- Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation 49 (2015), 549–580.
- sloWCrowd: A crowdsourcing tool for lexicographic tasks. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 3471–3475. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1106_Paper.pdf
- CrowdDB: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 61–72. https://doi.org/10.1145/1989323.1989331
- Advancing the Arabic WordNet: Elevating Content Quality. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, and Tamer Elsayed (Eds.). ELRA and ICCL, Torino, Italia, 74–83. https://aclanthology.org/2024.osact-1.9
- Using Crowd Agreement for Wordnet Localization. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1074
- Polona Gantar and Simon Krek. 2011. Slovene lexical database. Natural language processing, multilinguality (2011), 72–80.
- Fausto Giunchiglia and Mayukh Bagchi. 2021. Classifying concepts via visual properties. arXiv:2105.09422 [cs.AI] https://arxiv.org/abs/2105.09422
- One World-Seven Thousand Languages (Best Paper Award, Third Place). In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 220–235.
- Understanding and Exploiting Language Diversity. In International Joint Conference on Artificial Intelligence (IJCAI). 4009–4017.
- Representing interlingual meaning in lexical databases. Artificial Intelligence Review 56, 10 (2023), 11053–11069.
- Crowdsourcing a large scale multilingual lexico-semantic resource. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15).
- Large Language Models for Propaganda Span Annotation. arXiv:2311.09812 [cs.CL] https://arxiv.org/abs/2311.09812
- Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics and Information Technology 26, 8 (January 2024). https://doi.org/10.1007/s10676-023-09742-6
- Robert Kasumba and Marion Neumman. 2024. Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing. Proceedings of the AAAI Conference on Artificial Intelligence 38, 21 (Mar. 2024), 23110–23118. https://doi.org/10.1609/aaai.v38i21.30356
- Akihiro Katsuta and Kazuhide Yamamoto. 2020. Lexical simplification by unsupervised machine translation. International Journal of Asian Language Processing 30, 02 (2020), 2050008.
- Lexical diversity in kinship across languages and dialects. Frontiers in Psychology 14 (2023). https://doi.org/10.3389/fpsyg.2023.1229697
- The Dimensions of Lexical Semantic Resource Quality. In Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021, Abed Alhakim Freihat and Mourad Abbas (Eds.). Association for Computational Linguistics, Trento, Italy, 15–21. https://aclanthology.org/2021.nsurl-1.3
- Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2798–2807. https://aclanthology.org/2022.lrec-1.299
- Anetta Kopecka and Bhuvana Narasimhan. 2012. Events of putting and taking: A crosslinguistic perspective. Vol. 100. John Benjamins Publishing.
- IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
- Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability.
- Crowdsourcing Human Oversight on Image Tagging Algorithms: An initial study of image diversity. Zenodo. DOI 10 (2021).
- Tools for Collecting Speech Corpora via Mechanical-Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Chris Callison-Burch and Mark Dredze (Eds.). Association for Computational Linguistics, Los Angeles, 184–187. https://aclanthology.org/W10-0729
- Ewald Lang. 2001. Spatial dimension terms. In Language Typology and Language Universals. Vol. 2. De Gruyter Mouton, Berlin, Boston, 1251–1275. https://doi.org/doi:10.1515/9783110194265-028
- Crowdsourcing Ontology Lexicons. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Portorož, Slovenia, 3477–3484. https://aclanthology.org/L16-1554
- Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (Jan. 2012), 66–75. https://doi.org/10.1145/2093346.2093356
- Prompting Few-shot Multi-hop Question Generation via Comprehending Type-aware Semantics. In Findings of the Association for Computational Linguistics: NAACL 2024. 3730–3740.
- Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/cc1aa436277138f61cda703991069eaf-Paper.pdf
- Daniel Loureiro and Alípio Jorge. 2019. Language modelling makes sense: Propagating representations through WordNet for full-coverage word sense disambiguation. arXiv preprint arXiv:1906.10007 (2019).
- Bernardo Magnini and Gabriela Cavaglià. 2000. Integrating Subject Field Codes into WordNet. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (Eds.). European Language Resources Association (ELRA), Athens, Greece. http://www.lrec-conf.org/proceedings/lrec2000/pdf/219.pdf
- The semantic categories of cutting and breaking events: A crosslinguistic perspective. Cognitive Linguistics 18, 2 (2007), 133–152. https://doi.org/10.1515/COG.2007.005
- Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing. Transactions on Asian and Low-Resource Language information Processing 21, 4 (2022), 1–18.
- Modeling color terminology across thousands of languages. arXiv preprint arXiv:1910.01531 (2019).
- Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL] https://arxiv.org/abs/1301.3781
- George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
- Saif M Mohammad and Peter D Turney. 2013. Crowdsourcing a word–emotion association lexicon. Computational intelligence 29, 3 (2013), 436–465.
- George Peter Murdock. 1970. Kin Term Patterns and Their Distribution. Ethnology 9, 2 (1970), 165–208. http://www.jstor.org/stable/3772782
- Nandu Nair. 2022. A Crowdsourcing Methodology for Improving the Malayalam Wordnet. Available at SSRN 4064783 (2022).
- Gabriel Parent and Maxine Eskenazi. 2010. Clustering dictionary definitions using Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Chris Callison-Burch and Mark Dredze (Eds.). Association for Computational Linguistics, Los Angeles, 21–29. https://aclanthology.org/W10-0703
- Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia (Eds.). Association for Computational Linguistics, Montréal, Canada, 401–409. https://aclanthology.org/W12-3152
- David Martin Ward Powers. 2012. The Problem with Kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Walter Daelemans (Ed.). Association for Computational Linguistics, Avignon, France, 345–355. https://aclanthology.org/E12-1035
- Towards a typology of pain predicates. Linguistics 50, 3 (2012), 421–465.
- Tapped out or barely tapped? Recommendations for how to harness the vast and largely unused potential of the Mechanical Turk participant pool. PloS one 14, 12 (2019), e0226394.
- James Sneddon. 2003. The Indonesian Language. University of New South Wales Press Ltd, Sydney.
- No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672 [cs.CL] https://arxiv.org/abs/2207.04672
- Tim Penyusun Kamus Pusat Bahasa. 2008. Kamus Bahasa Indonesia. Pusat Bahasa Departemen Pendidikan Nasional.
- Petter Törnberg. 2023. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv:2304.06588 [cs.CL] https://arxiv.org/abs/2304.06588
- Bernhard Wälchli and Michael Cysouw. 2012. Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics 50, 3 (2012), 671–710.
- Matthijs J Warrens. 2011. Cohen’s kappa is a weighted average. Statistical Methodology 8, 6 (2011), 473–484.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017).
- Anna Wierzbicka. 2007. Bodies and their parts: An NSM approach to semantic typology. Language Sciences 29, 1 (2007), 14–65.
- Building a WordNet for Sinhala. In Proceedings of the Seventh Global Wordnet Conference, Heili Orav, Christiane Fellbaum, and Piek Vossen (Eds.). University of Tartu Press, Tartu, Estonia, 100–108. https://aclanthology.org/W14-0114
- Åke Viberg. 1984. The verbs of perception: a typological study. De Gruyter Mouton, Berlin, Boston, 123–162. https://doi.org/10.1515/9783110868555.123
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.