Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models (2306.08756v1)

Published 14 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Unilmv2: Pseudo-masked language models for unified language model pre-training. ArXiv, abs/2002.12804.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165.
  3. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in NLU. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 857–866, Seattle, United States. Association for Computational Linguistics.
  4. bert2BERT: Towards reusable pretrained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2134–2148, Dublin, Ireland. Association for Computational Linguistics.
  5. Bert for joint intent classification and slot filling.
  6. PaLM: Scaling language modeling with pathways. ArXiv, abs/2204.02311.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  8. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Unified language model pre-training for natural language understanding and generation. ArXiv, abs/1905.03197.
  11. Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 2893–2902, New York, NY, USA. Association for Computing Machinery.
  12. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
  13. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  14. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
  15. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2950–2962, Online. Association for Computational Linguistics.
  16. Enct5: A framework for fine-tuning t5 as non-autoregressive models.
  17. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  18. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
  19. Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
  20. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  21. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland. Association for Computational Linguistics.
  22. Universal dependencies v2: An evergrowing multilingual treebank collection. In International Conference on Language Resources and Evaluation.
  23. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  24. Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training.
  25. Language models are unsupervised multitask learners.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683.
  28. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  29. CLASP: Few-shot cross-lingual data augmentation for semantic parsing. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 444–462, Online only. Association for Computational Linguistics.
  30. LINGUIST: Language model instruction tuning to generate annotated utterances for intent classification and slot tagging. In Proceedings of the 29th International Conference on Computational Linguistics, pages 218–241, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  31. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280.
  32. Multitask prompted training enables zero-shot task generalization. ArXiv, abs/2110.08207.
  33. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. ArXiv, abs/2208.01448.
  34. Lamda: Language models for dialog applications. ArXiv, abs/2201.08239.
  35. Attention is all you need. ArXiv, abs/1706.03762.
  36. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  37. On layer normalization in the transformer architecture. ArXiv, abs/2002.04745.
  38. End-to-end slot alignment and recognition for cross-lingual NLU. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5052–5063, Online. Association for Computational Linguistics.
  39. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  40. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.