mEdIT: Multilingual Text Editing via Instruction Tuning (2402.16472v2)
Abstract: We introduce mEdIT, a multi-lingual extension to CoEdIT -- the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained LLMs via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as Grammatik korrigieren (German) or Parafrasee la oraci\'on (Spanish). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models at https://github.com/vipulraheja/medit.
- Advancements in Arabic grammatical error detection and correction: An empirical investigation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6430–6448, Singapore. Association for Computational Linguistics.
- Towards building arabic paraphrasing benchmark. In Proceedings of the Second International Conference on Data Science, E-Learning and Information Systems, DATA ’19, New York, NY, USA. Association for Computing Machinery.
- ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679, Online. Association for Computational Linguistics.
- EASSE: Easier automatic sentence simplification evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 49–54, Hong Kong, China. Association for Computational Linguistics.
- Language models for German text simplification: Overcoming parallel data scarcity through style-specific pre-training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1147–1158, Toronto, Canada. Association for Computational Linguistics.
- Anthony Baez and Horacio Saggion. 2023. LSLlama: Fine-tuned LLaMA for lexical simplification. In Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability, pages 102–108, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Adriane Boyd. 2018. Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 79–84, Brussels, Belgium. Association for Computational Linguistics.
- Olá, bonjour, salve! XFORMAL: A benchmark for multilingual formality style transfer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3199–3216, Online. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- The BEA-2019 shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75, Florence, Italy. Association for Computational Linguistics.
- Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805, Vancouver, Canada. Association for Computational Linguistics.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
- Novelty controlled paraphrase generation with retrieval augmented conditional prompt tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10535–10544.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Steven Coyne and Keisuke Sakaguchi. 2023. An analysis of gpt-3’s performance in grammatical error correction. arXiv preprint arXiv:2303.14342.
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
- Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572, Montréal, Canada. Association for Computational Linguistics.
- Developing NLP tools with a new corpus of learner Spanish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7238–7243, Marseille, France. European Language Resources Association.
- Understanding gender bias in knowledge base embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1381–1395, Dublin, Ireland. Association for Computational Linguistics.
- Improving grammatical error correction with multimodal feature integration. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9328–9344, Toronto, Canada. Association for Computational Linguistics.
- Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
- Data strategies for low-resource grammatical error correction. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 117–122, Online. Association for Computational Linguistics.
- Nizar Habash and David Palfreyman. 2022. ZAEBUC: An annotated Arabic-English bilingual writer corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 79–88, Marseille, France. European Language Resources Association.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Neural CRF model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7943–7960, Online. Association for Computational Linguistics.
- Akihiro Katsuta and Kazuhide Yamamoto. 2018. Crowdsourced corpus of sentence simplification with core vocabulary. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Yova Kementchedjhieva and Anders Søgaard. 2023. Grammatical error correction through round-trip machine translation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2208–2215, Dubrovnik, Croatia. Association for Computational Linguistics.
- Improving iterative text revision by learning where to edit from other revision tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9986–9999, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Construction of an evaluation corpus for grammatical error correction for learners of Japanese as a second language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 204–211, Marseille, France. European Language Resources Association.
- Few-shot controllable style transfer for low-resource multilingual settings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7439–7468, Dublin, Ireland. Association for Computational Linguistics.
- Vivek Kulkarni and Vipul Raheja. 2023. Writing assistants should model social factors of language.
- Beyond the chat: Executable and verifiable text-editing with llms. arXiv preprint arXiv:2309.15337.
- Multilingual pre-training with language and task adaptation for multilingual text style transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 262–271, Dublin, Ireland. Association for Computational Linguistics.
- Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
- Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Prompt-based editing for text style transfer. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5740–5750.
- EdiT5: Semi-autoregressive text editing with t5 warm-start. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2126–2138, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Zero-shot crosslingual sentence simplification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5109–5126, Online. Association for Computational Linguistics.
- MUSS: Multilingual unsupervised sentence simplification by mining paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1651–1664, Marseille, France. European Language Resources Association.
- Takumi Maruyama and Kazuhide Yamamoto. 2018. Simplified corpus with core vocabulary. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Mining revision log of language learning SNS for automated Japanese error correction of second language learners. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147–155, Chiang Mai, Thailand. Asian Federation of Natural Language Processing.
- The first QALB shared task on automatic text correction for Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pages 39–47, Doha, Qatar. Association for Computational Linguistics.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
- Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China. Association for Computational Linguistics.
- Gleu without tuning.
- JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 229–234, Valencia, Spain. Association for Computational Linguistics.
- Unsupervised paraphrasing with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5136–5150, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 Technical Report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Coedit: Text editing by task-specific instruction tuning. arXiv preprint arXiv:2305.09857.
- A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, Online. Association for Computational Linguistics.
- A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 702–707, Online. Association for Computational Linguistics.
- The second QALB shared task on automatic text correction for Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 26–35, Beijing, China. Association for Computational Linguistics.
- Revisiting non-English text simplification: A unified multilingual benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4898–4927, Toronto, Canada. Association for Computational Linguistics.
- Findings of the TSAR-2022 shared task on multilingual lexical simplification. In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 271–283, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational Linguistics.
- Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations.
- PEER: A collaborative language model. In The Eleventh International Conference on Learning Representations.
- NSURL-2019 task 8: Semantic question similarity in Arabic. In Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers, pages 1–8, Trento, Italy. Association for Computational Linguistics.
- Subjective text complexity assessment for German. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 707–714, Marseille, France. European Language Resources Association.
- A unified strategy for multilingual grammatical error correction with pre-trained cross-lingual language model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 4367–4374. International Joint Conferences on Artificial Intelligence Organization. Main Track.
- Tense and aspect error correction for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198–202, Jeju Island, Korea. Association for Computational Linguistics.
- Yasuhito Tanaka. 2001. Compilation of a multilingual parallel corpus. Proceedings of PACLING 2001, pages 265–268.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Polylm: An open source polyglot large language model.
- Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 215–220, Hong Kong. Association for Computational Linguistics.
- BigScience Workshop. 2023. Bloom: A 176b-parameter open-access multilingual language model.
- Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark.
- Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
- Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- A new dataset and empirical study for sentence simplification in Chinese. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8306–8321, Toronto, Canada. Association for Computational Linguistics.
- Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098.
- Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
- PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
- Towards standardizing Korean grammatical error correction: Datasets and annotation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6713–6742, Toronto, Canada. Association for Computational Linguistics.
- Xingxing Zhang and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 584–594, Copenhagen, Denmark. Association for Computational Linguistics.
- Bidirectional transformer reranker for grammatical error correction. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3801–3825, Toronto, Canada. Association for Computational Linguistics.
- PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
- Overview of the nlpcc 2018 shared task: Grammatical error correction. In Natural Language Processing and Chinese Computing.
- Extrapolating large language models to non-english by aligning languages.
- Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.