Multilingual Controllable Transformer-Based Lexical Simplification (2307.02120v1)
Abstract: Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked LLMs to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.
- 2019. Optuna: A next-generation hyperparameter optimization framework. In A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis, editors, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 2623–2631. ACM.
- 2021a. Exploration of Spanish Word Embeddings for Lexical Simplification. In H. Saggion, S. Stajner, D. Ferrés, and K. C. Sheang, editors, Proceedings of the First Workshop on Current Trends in Text Simplification (CTTS 2021) Co-Located with the 37th Conference of the Spanish Society for Natural Language Processing (SEPLN2021), Online (Initially Located in Málaga, Spain), September 21st, 2021, volume 2944 of CEUR Workshop Proceedings. CEUR-WS.org.
- 2021b. Lexical Simplification System to Improve Web Accessibility. IEEE Access, 9:58755–58767.
- Aleksandrova, D. and O. Brochu Dufour. 2022. RCML at TSAR-2022 shared task: Lexical simplification with modular substitution candidate ranking. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 259–263, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Aumiller, D. and M. Gertz. 2022. UniHD at TSAR-2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251–258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Barzilay, R. and M. Lapata. 2005. Modeling local coherence: An entity-based approach. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 141–148, Ann Arbor, Michigan. Association for Computational Linguistics.
- 2011. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 496–501, Portland, Oregon, USA, June. Association for Computational Linguistics.
- 2022. PolyU-CBS at TSAR-2022 shared task: A simple, rank-based method for complex word substitution in two steps. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 225–230, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2010. Text Simplification for Children. In Prroceedings of the SIGIR Workshop on Accessible Search Systems, pages 19–26, Genève.
- 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- 2017. An adaptable lexical simplification architecture for major Ibero-Romance languages. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 40–47, Copenhagen, Denmark. Association for Computational Linguistics.
- 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758–764, Atlanta, Georgia. Association for Computational Linguistics.
- Glavaš, G. and S. Štajner. 2015. Simplifying lexical simplification: Do we need simplified corpora? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 63–68, Beijing, China. Association for Computational Linguistics.
- Gooding, S. and E. Kochmar. 2019. Recursive context-aware lexical simplification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4853–4863, Hong Kong, China. Association for Computational Linguistics.
- 2014. Learning a lexical simplifier using Wikipedia. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 458–463, Baltimore, Maryland. Association for Computational Linguistics.
- 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- 2022. MANTIS at TSAR-2022 shared task: Improved unsupervised lexical simplification with pretrained encoders. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 243–250, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2021. Controllable text simplification with explicit paraphrasing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3536–3553, Online. Association for Computational Linguistics.
- 2020. Controllable sentence simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4689–4698, Marseille, France. European Language Resources Association.
- 2022. MUSS: Multilingual unsupervised sentence simplification by mining paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1651–1664, Marseille, France. European Language Resources Association.
- 2019. Lexical simplification approach to support the accessibility guidelines. In Proceedings of the XX International Conference on Human Computer Interaction, pages 1–4, Donostia Gipuzkoa Spain. ACM.
- Nikita, N. and P. Rajpoot. 2022. teamPN at TSAR-2022 shared task: Lexical simplification using multi-level and modular approach. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 239–242, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2022. GMU-WLV at TSAR-2022 shared task: Evaluating lexical simplification models. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 264–270, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Paetzold and Specia. 2016. BenchLS: A Reliable Dataset for Lexical Simplification. Zenodo.
- Paetzold, G. H. and L. Specia. 2017. A Survey on Lexical Simplification. Journal of Artificial Intelligence Research, 60:549–593.
- 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- 2020. LSBert: Lexical Simplification Based on BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, abs/2006.14939:3064–3076.
- 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67.
- Reimers, N. and I. Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Reimers, N. and I. Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11.
- Saggion, H. 2017. Automatic Text Simplification, volume 10 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
- 2022. Findings of the TSAR-2022 shared task on multilingual lexical simplification. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 271–283, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2022. CILS at TSAR-2022 shared task: Investigating the applicability of lexical substitution methods for lexical simplification. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 207–212, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Shardlow, M. 2014. A Survey of Automated Text Simplification. International Journal of Advanced Computer Science and Applications, 4(1).
- 2020. CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pages 57–62, Marseille, France. European Language Resources Association.
- 2022. Controllable lexical simplification for English. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 199–206, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Sheang, K. C. and H. Saggion. 2021. Controllable sentence simplification with a unified text-to-text transfer transformer. In Proceedings of the 14th International Conference on Natural Language Generation, pages 341–352, Aberdeen, Scotland, UK. Association for Computational Linguistics.
- 2023. LLaMA: Open and Efficient Foundation Language Models.
- 2022. UoM&MMU at TSAR-2022 shared task: Prompt learning for lexical simplification. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 218–224, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2022. PresiUniv at TSAR-2022 shared task: Generation and ranking of simplification substitutes of complex words in multiple languages. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 213–217, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- 2022. CENTAL at TSAR-2022 shared task: How does context impact BERT-Generated substitutions for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 231–238, Abu Dhabi, United Arab Emirates (Virtual), December. Association for Computational Linguistics.
- 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Kim Cheng Sheang (3 papers)
- Horacio Saggion (14 papers)