Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiLS: A Multi-task Lexical Simplification Framework (2402.14972v1)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is a precursor to Text Simplification with the aim of improving text accessibility to various target demographics, including children, second language learners, individuals with reading disabilities or low literacy. Several datasets exist for LS. These LS datasets specialize on one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset to be created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1). lexical complexity prediction (LCP), (2). substitute generation, and (3). substitute ranking for Portuguese. Model performances are reported, ranging from transformer-based models to more recent LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Aleksei V. Abramov and Vladimir V. Ivanov. 2022. Collection and evaluation of lexical complexity data for Russian language using crowdsourcing. Russian Journal of Linguistics 26, 2 (2022), 409–425.
  2. Lexical Simplification System to Improve Web Accessibility. IEEE Access 9 (2021), 58755–58767.
  3. Sandra Maria Aluísio and Caroline Gasperin. 2010. Fostering Digital Inclusion and Accessibility: The PorSimples Project for Simplification of Portuguese Texts. In Proceedings of YIWCALA.
  4. Findings of the WMT 2019 Biomedical Translation Shared Task: Evaluation for MEDLINE Abstracts and Biomedical Terminologies. In Proceedings of WMT.
  5. ReSyf: a French lexicon with ranked synonyms. In Proceedings of ACL.
  6. Marc Brysbaert and Andrew Biemiller. 2017. Test-based age-of-acquisition norms for 44 thousand English word meanings. Behavioural Research 49 (2017), 1520–1523.
  7. Word prevalence norms for 62,000 English lemmas. Behavior Research Methods 51 (2019), 467–479.
  8. Creating a Silver Standard for Patent Simplification. In Proceedings of SIGIR ’23.
  9. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of ACL.
  10. LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction. In Proceedings of SemEval.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
  12. Ethnologue: Languages of the World (twenty-sixth ed.). SIL International, Dallas, Texas.
  13. Daniel Ferres and Horacio Saggion. 2022. ALEXSIS: A Dataset for Lexical Simplification in Spanish. In Proceedings of LREC.
  14. Nathan Siegle Hartmann and Sandra Maria Aluísio. 2020. Adaptação Lexical Automática em Textos Informativos do Português Brasileiro para o Ensino Fundamental. Linguamática 12, 2 (2020), 3–27.
  15. Learning a Lexical Simplifier Using Wikipedia. In Proceedings of ACL.
  16. Timothy D. Ireland. 2008. Literacy in Brazil: From Rights to REality. International Review of Education 54, 5/6 (2008), 713–732.
  17. Mistral 7B. arXiv: 2310.06825 (2023).
  18. Tomoyuki Kajiwara and Kazuhide Yamamoto. 2015. Evaluation Dataset and System for Japanese Lexical Simplification. In Proceedings of ACL.
  19. A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese. In Proceedings of COLING.
  20. John Lee and Chak Yan Yeung. 2018. Automatic prediction of vocabulary knowledge for learners of Chinese as a foreign language. In Proceedings of ICNLSP.
  21. Mounica Maddela and Wei Xu. 2018. A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification. In Proceedings of EMNLP.
  22. Diana McCarthy and Roberto Navigli. 2007. SemEval-2007 Task 10: English Lexical Substitution Task. In Proceedings of SemEval.
  23. Borbor Merejildo. 2021. Creación de un corpus de textos universitarios en español para la identificación de palabras complejas en el área de la simplificación léxica. Master’s thesis. Universidad de Guayaquil.
  24. ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval. In Proceedings of BEA.
  25. GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models. In Proceedings of TSAR.
  26. Deep Learning Approaches to Lexical Simplification: A Survey.
  27. ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification. In Proceedings of COLING.
  28. An Evaluation of Binary Comparative Lexical Complexity Models. In Proceedings of BEA.
  29. Lexical Complexity Prediction: An Overview. Comput. Surveys 55, 9, Article 179 (2022).
  30. Features of lexical complexity: insights from L1 and L2 speakers. Frontiers in Artificial Intelligence (2023).
  31. Gustavo Paetzold and Lucia Specia. 2016a. PLUMBErr: An Automatic Error Identification Framework for Lexical Simplification. In Proceedings of LREC.
  32. Gustavo Paetzold and Lucia Specia. 2016b. SemEval 2016 Task 11: Complex Word Identification. In Proceedings of SemEval.
  33. Gustavo Henrique Paetzold and Lucia Specia. 2015. LEXenstein: A Framework for Lexical Simplification. In ACL 2015 System Demonstrations. 85–90.
  34. Gustavo Henrique Paetzold and Lucia Specia. 2016c. Benchmarking Lexical Simplification Systems. In Proceedings of LREC.
  35. Gustavo Henrique Paetzold and Lucia Specia. 2016d. Unsupervised lexical simplification for non-native speakers. In Proceedings of AAAI.
  36. Gustavo H. Paetzold and Lucia Specia. 2017. A Survey on Lexical Simplification. J. Artif. Int. Res. 60, 1 (2017), 549–593.
  37. Leveraging Social Media for Medical Text Simplification. In Proceedings of SIGIR ’20.
  38. Lexical Simplification with Pretrained Encoders. In Proceedings of AAAI.
  39. Chinese Lexical Simplification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1819–1828.
  40. Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*. arXiv: 2305.06721 (2023).
  41. Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification. In Proceedings of TSAR.
  42. Matthew Shardlow. 2013. The CW Corpus: A New Resource for Evaluating the Identification of Complex Words. In Proceedings of ACL.
  43. CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data. In Proceedings of READI.
  44. SemEval-2021 Task 1: Lexical Complexity Prediction. In Proceedings of SemEval.
  45. Predicting Lexical Complexity in English Texts. In Proceedings of LREC.
  46. Predicting Lexical Complexity in English Texts: The Complex 2.0 Dataset. Language Resources and Evaluation 56, 4 (2022), 1153–1194.
  47. Greta Smolenska. 2018. Complex Word Identification for Swedish. Master’s thesis. Uppsala University, Sweden.
  48. BERTimbau: pretrained BERT models for Brazilian Portuguese. In Proceedings of BRACIS.
  49. SemEval - 2012 Task 1: English Lexical Simplification. In Proceedings of SemEval.
  50. Lexical Simplification Benchmarks for English, Portuguese, and Spanish. Frontiers in Artificial Intelligence (2022).
  51. Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource. In Proceedings of LREC.
  52. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv: 2307.09288 (2023).
  53. A Report on the Complex Word Identification Shared Task 2018. In Proceedings of BEA.
  54. Jenny Alexandra Ortiz Zambrano and Arturo Montejo Ráez. 2020. Overview of ALexS 2020: First Workshop on Lexical Analysis at SEPLN. In Proceedings of ALexS.
  55. Complex Word Identification: Challenges in Data Annotation and System Performance. In Proceedings of NLP-TEA.
  56. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of CCL.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kai North (13 papers)
  2. Tharindu Ranasinghe (52 papers)
  3. Matthew Shardlow (20 papers)
  4. Marcos Zampieri (94 papers)
Citations (7)