Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations (2404.07814v2)

Published 11 Apr 2024 in cs.CL

Abstract: Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. 2021. Exploration of Spanish Word Embeddings for Lexical Simplification. In Proceedings of the First Workshop on Current Trends in Text Simplification (CTTS 2021), volume 2944 of CEUR Workshop Proceedings. CEUR-WS.org, September.
  2. 2021. Lexical Simplification System to Improve Web Accessibility. IEEE Access, 9:58755–58767, April.
  3. Anonymous. 2024. Spanish sentence simplification dataset. In preparation.
  4. 2021. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4933–4946, Online, August. Association for Computational Linguistics.
  5. 2015. CASSA: A Context-Aware Synonym Simplification Algorithm. In NAACL HLT 2015, pages 1380–1385.
  6. Bautista, S. and H. Saggion. 2014. Making Numerical Information more Accessible: Implementation of a Numerical Expressions Simplification Component for Spanish. ITL- International Journal of Applied Linguistics, (Special Issue on Readability and Text Simplification):299–323, 01/ 2015.
  7. 2011. Putting It Simply: A Context-aware Approach to Lexical Simplification. In Proceedings of the ACL 2011, pages 496–501.
  8. 2020a. SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan. Behavior Research Methods, 1(52):360–375.
  9. 2020b. Subtlex-cat: Subtitle word frequencies and contextual diversity for catalan. Behavior Research Methods, 52:360–375.
  10. 2012a. Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish. In COLING, pages 357–374. Indian Institute of Technology Bombay.
  11. 2012b. Can spanish be simpler? lexsis: Lexical simplification for spanish. In Proceedings of COLING 2012, pages 357–374.
  12. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020.
  13. 1998. Practical simplification of english newspaper text to assist aphasic readers. In Proceedings of the AAAI-98 workshop on integrating artificial intelligence and assistive technology, pages 7–10. Association for the Advancement of Artificial Intelligence.
  14. 2011. SUBTLEX-ESP: Spanish word frequencies based on film subtitles. PSICOLOGICA, 32(2):133–143.
  15. 2010. Text Simplification for Children. In Proceedings of the SIGIR Workshop on Accessible Search Systems, pages 19–26.
  16. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  17. Devlin, S. and J. Tait. 1998. The Use of a Psycholinguistic Database in the Simplification of Text for Aphasic Readers. In Linguistic Databases, pages 161–173.
  18. 2017. An Adaptable Lexical Simplification Architecture for Major Ibero-Romance Languages. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 40–47, Copenhagen, Denmark, September. Association for Computational Linguistics.
  19. 2017. An adaptable lexical simplification architecture for major ibero-romance languages. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 40–47.
  20. Ferrés, D. and H. Saggion. 2022. ALEXSIS: A Dataset for Lexical Simplification in Spanish. In Language Resources and Evaluation Conference (LREC-2022).
  21. Gillin, N. 2016. Sensible at semeval-2016 task 11: Neural nonsense mangled in ensemble mess. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 963–968.
  22. Glavaš, G. and S. Štajner. 2015. Simplifying Lexical Simplification: Do We Need Simplified Corpora? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 63–68, Beijing, China, July. Association for Computational Linguistics.
  23. 2014. Learning a Lexical Simplifier Using Wikipedia. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 458–463, Baltimore, Maryland, June. Association for Computational Linguistics.
  24. 2022. Simplification of literary and scientific texts to improve reading fluency and comprehension in beginning readers of french. Applied Psycholinguistics, 43(2):485–512.
  25. 2017. A review of the norwegian plain language policy. In Electronic Government: 16th IFIP WG 8.5 International Conference, EGOV 2017, St. Petersburg, Russia, September 4-7, 2017, Proceedings 16, pages 187–198. Springer.
  26. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2):155–163.
  27. 2021. Differences in communication skills among elementary students with mild intellectual disabilities after using easy-to-read texts. The new educational review, 64:236–246.
  28. 2010. Guidelines for easy-to-read materials. International Federation of Library Associations and Institutions (IFLA).
  29. 2023. Lexical complexity prediction: An overview. ACM Computing Surveys, 55(9):1–42.
  30. Ortiz-Zambrano, J. A. and A. Montejo-Ráez. 2020. Overview of ALexS 2020: First workshop on lexical analysis at SEPLN. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), volume 2664, pages 1–6.
  31. Paetzold, G. and L. Specia. 2016. Semeval 2016 task 11: Complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 560–569.
  32. Paetzold, G. and L. Specia. 2017. Lexical Simplification with Neural Ranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 34–40, Valencia, Spain, April. Association for Computational Linguistics.
  33. 2020. LSBert: A Simple Framework for Lexical Simplification. arXiv preprint arXiv:2006.14939.
  34. Quijada, M. and J. Medero. 2016. Hmc at semeval-2016 task 11: Identifying complex words using depth-limited decision trees. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1034–1037.
  35. Rennes, E. 2022. Automatic Adaptation of Swedish Text for Increased Inclusion. Ph.D. thesis, Linköping University Electronic Press.
  36. Rets, I. and J. Rogaten. 2021. To simplify or not? facilitating english l2 users’ comprehension and processing of open educational resources in english using text simplification. Journal of Computer Assisted Learning, 37(3):705–717.
  37. 2016. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic features. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1011–1016.
  38. Saggion, H. 2017. Automatic Text Simplification. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
  39. 2015. Making it Simplext: Implementation and Evaluation of a Text Simplification System for Spanish. TACCESS, 6(4):14.
  40. 2022. Findings of the TSAR-2022 shared task on multilingual lexical simplification. In S. Štajner, H. Saggion, D. Ferrés, M. Shardlow, K. C. Sheang, K. North, M. Zampieri, and W. Xu, editors, Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 271–283, Abu Dhabi, United Arab Emirates (Virtual), December. Association for Computational Linguistics.
  41. Shardlow, M. 2013. A comparison of techniques to automatically identify complex words. In 51st annual meeting of the association for computational linguistics proceedings of the student research workshop, pages 103–109.
  42. Shardlow, M. 2014a. A Survey of Automated Text Simplification. International Journal of Advanced Computer Science and Applications, 4, 01.
  43. Shardlow, M. 2014b. Out in the open: Finding and categorising errors in the lexical simplification pipeline. In LREC, pages 1583–1590.
  44. 2020. Complex: A new corpus for lexical complexity prediction from likert scale data. arXiv preprint arXiv:2003.07008.
  45. 2021. Semeval-2021 task 1: Lexical complexity prediction. arXiv preprint arXiv:2106.00473.
  46. Sheang, K. C. 2019. Multilingual complex word identification: Convolutional neural networks with morphological and linguistic features. In Proceedings of the Student Research Workshop (RANLPStud 2019); 2019 Sep 2-4; Varna, Bulgaria.[Varna]: ACL; 2019. p. 83-9. ACL (Association for Computational Linguistics).
  47. 1979. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2):420.
  48. Stajner, S. 2014. Translating Sentences from Original to Simplified Spanish. Procesamiento del lenguaje natural, 53:61–68.
  49. 2023. Less: A computationally-light lexical simplifier for spanish. In R. Mitkov and G. Angelova, editors, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023, Varna, Bulgaria, 4-6 September 2023, pages 1132–1142. INCOMA Ltd., Shoumen, Bulgaria.
  50. Vallat, R. 2018. Pingouin: statistics in python. J. Open Source Softw., 3(31):1026.
  51. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
  52. 2021. Just-blue at semeval-2021 task 1: Predicting lexical complexity using bert and roberta pre-trained language models. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), pages 661–666.
  53. 2010. For the Sake of Simplicity: Unsupervised Extraction of Lexical Simplifications from Wikipedia. In Proceedings of HLT-NAACL 2010.
  54. 2018a. A Report on the Complex Word Identification Shared Task 2018. CoRR, abs/1804.09132.
  55. 2018b. A report on the complex word identification shared task 2018. arXiv preprint arXiv:1804.09132.
  56. 2017. Complex word identification: Challenges in data annotation and system performance. arXiv preprint arXiv:1710.04989.
  57. 2019. Improving Lexical Coverage of Text Simplification Systems for Spanish. Expert Systems with Applications, 118:80–91.
Citations (2)

Summary

We haven't generated a summary for this paper yet.