Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Crowdsourcing Lexical Diversity (2410.23133v1)

Published 30 Oct 2024 in cs.CL

Abstract: Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emph{lexical gaps}, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Ken Albala. 2011. Food Cultures of the World Encyclopedia:[4 Volumes]. Bloomsbury Publishing USA.
  2. Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv preprint arXiv:2307.02179 101 (2023).
  3. Febe Armanios and Bogac Ergene. 2018. Halal food: A history. Oxford University Press.
  4. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational linguistics 34, 4 (2008), 555–596.
  5. Food and cultural studies. Routledge.
  6. WN-BERT: Integrating wordnet and BERT for lexical semantics in natural language understanding. Computational Linguistics in the Netherlands Journal 11 (2021), 105–124.
  7. WordNet: A lexical database organized on psycholinguistic principles. In Lexical Acquisition. Psychology Press, 211–232.
  8. Language Diversity: Visible to Humans, Exploitable by Machines. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Valerio Basile, Zornitsa Kozareva, and Sanja Stajner (Eds.). Association for Computational Linguistics, Dublin, Ireland, 156–165. https://doi.org/10.18653/v1/2022.acl-demo.15
  9. Tackling Language Modelling Bias in Support of Linguistic Diversity. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (Rio de Janeiro, Brazil) (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 562–572. https://doi.org/10.1145/3630106.3658925
  10. A Major Wordnet for a Minority Language: Scottish Gaelic. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2812–2818. https://aclanthology.org/2020.lrec-1.342
  11. Martin Benjamin and Paula Radetzky. 2014. Multilingual lexicography with a focus on less-resourced languages: Data mining, expert input, crowdsourcing, and gamification. In 9th edition of the Language Resources and Evaluation Conference.
  12. Chris Biemann and Valerie Nygaard. 2010. Crowdsourcing wordnet. In The 5th International Conference of the Global WordNet Association (GWC-2010).
  13. Francis Bond and Ryan Foster. 2013. Linking and Extending an Open Multilingual Wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Hinrich Schuetze, Pascale Fung, and Massimo Poesio (Eds.). Association for Computational Linguistics, Sofia, Bulgaria, 1352–1362. https://aclanthology.org/P13-1133
  14. Richard A Brualdi. 2004. Introductory combinatorics. Pearson Education India.
  15. NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13745–13818. https://doi.org/10.18653/v1/2023.findings-acl.868
  16. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Philipp Koehn and Rada Mihalcea (Eds.). Association for Computational Linguistics, Singapore, 286–295. https://aclanthology.org/D09-1030
  17. DiBiMT: A Novel Benchmark for Measuring Word Sense Disambiguation Biases in Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 4331–4352. https://doi.org/10.18653/v1/2022.acl-long.298
  18. IndoUKC: A Concept-Centered Indian Multilingual Lexical Resource. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2833–2840. https://aclanthology.org/2022.lrec-1.303
  19. It’s About Time: A View of Crowdsourced Data Before and During the Pandemic. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 649, 14 pages. https://doi.org/10.1145/3411764.3445317
  20. Jaka Čibej and Špela Arhar Holdt. 2019. Repel the syntruders! A crowdsourcing cleanup of the thesaurus of modern Slovene. In Proceedings of the ELex 2019 Conference: Electronic lexicography in the 21st century, Sintra, Portugal.
  21. The WordNet in Indian Languages (1st ed.). Springer Singapore.
  22. Is GPT-3 a Good Data Annotator?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 11173–11195. https://doi.org/10.18653/v1/2023.acl-long.626
  23. Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation 49 (2015), 549–580.
  24. sloWCrowd: A crowdsourcing tool for lexicographic tasks. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 3471–3475. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1106_Paper.pdf
  25. CrowdDB: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 61–72. https://doi.org/10.1145/1989323.1989331
  26. Advancing the Arabic WordNet: Elevating Content Quality. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, and Tamer Elsayed (Eds.). ELRA and ICCL, Torino, Italia, 74–83. https://aclanthology.org/2024.osact-1.9
  27. Using Crowd Agreement for Wordnet Localization. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1074
  28. Polona Gantar and Simon Krek. 2011. Slovene lexical database. Natural language processing, multilinguality (2011), 72–80.
  29. Fausto Giunchiglia and Mayukh Bagchi. 2021. Classifying concepts via visual properties. arXiv:2105.09422 [cs.AI] https://arxiv.org/abs/2105.09422
  30. One World-Seven Thousand Languages (Best Paper Award, Third Place). In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 220–235.
  31. Understanding and Exploiting Language Diversity. In International Joint Conference on Artificial Intelligence (IJCAI). 4009–4017.
  32. Representing interlingual meaning in lexical databases. Artificial Intelligence Review 56, 10 (2023), 11053–11069.
  33. Crowdsourcing a large scale multilingual lexico-semantic resource. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15).
  34. Large Language Models for Propaganda Span Annotation. arXiv:2311.09812 [cs.CL] https://arxiv.org/abs/2311.09812
  35. Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics and Information Technology 26, 8 (January 2024). https://doi.org/10.1007/s10676-023-09742-6
  36. Robert Kasumba and Marion Neumman. 2024. Practical Sentiment Analysis for Education: The Power of Student Crowdsourcing. Proceedings of the AAAI Conference on Artificial Intelligence 38, 21 (Mar. 2024), 23110–23118. https://doi.org/10.1609/aaai.v38i21.30356
  37. Akihiro Katsuta and Kazuhide Yamamoto. 2020. Lexical simplification by unsupervised machine translation. International Journal of Asian Language Processing 30, 02 (2020), 2050008.
  38. Lexical diversity in kinship across languages and dialects. Frontiers in Psychology 14 (2023). https://doi.org/10.3389/fpsyg.2023.1229697
  39. The Dimensions of Lexical Semantic Resource Quality. In Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021, Abed Alhakim Freihat and Mourad Abbas (Eds.). Association for Computational Linguistics, Trento, Italy, 15–21. https://aclanthology.org/2021.nsurl-1.3
  40. Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 2798–2807. https://aclanthology.org/2022.lrec-1.299
  41. Anetta Kopecka and Bhuvana Narasimhan. 2012. Events of putting and taking: A crosslinguistic perspective. Vol. 100. John Benjamins Publishing.
  42. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
  43. Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability.
  44. Crowdsourcing Human Oversight on Image Tagging Algorithms: An initial study of image diversity. Zenodo. DOI 10 (2021).
  45. Tools for Collecting Speech Corpora via Mechanical-Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Chris Callison-Burch and Mark Dredze (Eds.). Association for Computational Linguistics, Los Angeles, 184–187. https://aclanthology.org/W10-0729
  46. Ewald Lang. 2001. Spatial dimension terms. In Language Typology and Language Universals. Vol. 2. De Gruyter Mouton, Berlin, Boston, 1251–1275. https://doi.org/doi:10.1515/9783110194265-028
  47. Crowdsourcing Ontology Lexicons. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Portorož, Slovenia, 3477–3484. https://aclanthology.org/L16-1554
  48. Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (Jan. 2012), 66–75. https://doi.org/10.1145/2093346.2093356
  49. Prompting Few-shot Multi-hop Question Generation via Comprehending Type-aware Semantics. In Findings of the Association for Computational Linguistics: NAACL 2024. 3730–3740.
  50. Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/cc1aa436277138f61cda703991069eaf-Paper.pdf
  51. Daniel Loureiro and Alípio Jorge. 2019. Language modelling makes sense: Propagating representations through WordNet for full-coverage word sense disambiguation. arXiv preprint arXiv:1906.10007 (2019).
  52. Bernardo Magnini and Gabriela Cavaglià. 2000. Integrating Subject Field Codes into WordNet. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (Eds.). European Language Resources Association (ELRA), Athens, Greece. http://www.lrec-conf.org/proceedings/lrec2000/pdf/219.pdf
  53. The semantic categories of cutting and breaking events: A crosslinguistic perspective. Cognitive Linguistics 18, 2 (2007), 133–152. https://doi.org/10.1515/COG.2007.005
  54. Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing. Transactions on Asian and Low-Resource Language information Processing 21, 4 (2022), 1–18.
  55. Modeling color terminology across thousands of languages. arXiv preprint arXiv:1910.01531 (2019).
  56. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL] https://arxiv.org/abs/1301.3781
  57. George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  58. Saif M Mohammad and Peter D Turney. 2013. Crowdsourcing a word–emotion association lexicon. Computational intelligence 29, 3 (2013), 436–465.
  59. George Peter Murdock. 1970. Kin Term Patterns and Their Distribution. Ethnology 9, 2 (1970), 165–208. http://www.jstor.org/stable/3772782
  60. Nandu Nair. 2022. A Crowdsourcing Methodology for Improving the Malayalam Wordnet. Available at SSRN 4064783 (2022).
  61. Gabriel Parent and Maxine Eskenazi. 2010. Clustering dictionary definitions using Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Chris Callison-Burch and Mark Dredze (Eds.). Association for Computational Linguistics, Los Angeles, 21–29. https://aclanthology.org/W10-0703
  62. Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia (Eds.). Association for Computational Linguistics, Montréal, Canada, 401–409. https://aclanthology.org/W12-3152
  63. David Martin Ward Powers. 2012. The Problem with Kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Walter Daelemans (Ed.). Association for Computational Linguistics, Avignon, France, 345–355. https://aclanthology.org/E12-1035
  64. Towards a typology of pain predicates. Linguistics 50, 3 (2012), 421–465.
  65. Tapped out or barely tapped? Recommendations for how to harness the vast and largely unused potential of the Mechanical Turk participant pool. PloS one 14, 12 (2019), e0226394.
  66. James Sneddon. 2003. The Indonesian Language. University of New South Wales Press Ltd, Sydney.
  67. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672 [cs.CL] https://arxiv.org/abs/2207.04672
  68. Tim Penyusun Kamus Pusat Bahasa. 2008. Kamus Bahasa Indonesia. Pusat Bahasa Departemen Pendidikan Nasional.
  69. Petter Törnberg. 2023. ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning. arXiv:2304.06588 [cs.CL] https://arxiv.org/abs/2304.06588
  70. Bernhard Wälchli and Michael Cysouw. 2012. Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics 50, 3 (2012), 671–710.
  71. Matthijs J Warrens. 2011. Cohen’s kappa is a weighted average. Statistical Methodology 8, 6 (2011), 473–484.
  72. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017).
  73. Anna Wierzbicka. 2007. Bodies and their parts: An NSM approach to semantic typology. Language Sciences 29, 1 (2007), 14–65.
  74. Building a WordNet for Sinhala. In Proceedings of the Seventh Global Wordnet Conference, Heili Orav, Christiane Fellbaum, and Piek Vossen (Eds.). University of Tartu Press, Tartu, Estonia, 100–108. https://aclanthology.org/W14-0114
  75. Åke Viberg. 1984. The verbs of perception: a typological study. De Gruyter Mouton, Berlin, Boston, 123–162. https://doi.org/10.1515/9783110868555.123

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: