AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model (2208.01448v2)

Published 2 Aug 2022 in cs.CL and cs.LG

Abstract: In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal LLMing (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale LLM training.

PDF Abstract

Overview of AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2seq Model

The paper presents AlexaTM 20B, a large-scale multilingual sequence-to-sequence (seq2seq) model developed for few-shot learning. It illustrates the advantages of seq2seq architectures, particularly in multilingual settings, over the traditionally favored decoder-only models in language tasks. Through extensive evaluation, the authors highlight AlexaTM 20B's ability to achieve state-of-the-art (SOTA) performance on various tasks, including summarization and machine translation, outperforming significantly larger models such as PaLM 540B and GPT3 175B.

Model Design and Training

AlexaTM 20B utilizes a standard transformer architecture modified with pre-layer normalization for added training stability. It incorporates both denoising and causal LLMing (CLM) during pre-training, which importantly facilitates in-context learning. The model is trained on approximately 1 trillion tokens from the Wikipedia and mC4 datasets across 12 languages, including less-resourced ones like Marathi and Tamil. This diverse training corpus enables AlexaTM 20B to excel in multilingual tasks.

Few-Shot Performance

The paper demonstrates the efficacy of AlexaTM 20B in few-shot learning across various language tasks:

Summarization: AlexaTM 20B showcases superior performance in 1-shot settings for summarization tasks over PaLM 540B, especially noting improvements in comprehension with long contextual inputs.
Machine Translation: In the Flores-101 dataset, the model surpasses existing supervised models in one-shot translations for nearly all examined language pairs, particularly excelling in translations to and from low-resource languages.
Other Multilingual Tasks: AlexaTM 20B achieves SOTA scores in zero-shot evaluations for XNLI, XCOPA, Paws-X, and XWinograd, improving upon previous models like XGLM 7.5B.

Challenges and Observations

Despite its advantages, AlexaTM 20B experiences limitations in reasoning tasks compared to larger models like GPT3. The paper notes that scaling, alongside specific prompting strategies, could further enhance seq2seq models.

Additionally, the authors explore issues related to memorization, bias, and fairness:

Memorization: AlexaTM 20B's architecture reduces memorization correlated with context size, addressing some privacy concerns.
Bias and Fairness: The model exhibits biases similar to other large LLMs, warranting task-specific mitigation strategies before deployment.

Implications and Future Directions

The findings advocate for seq2seq models as viable alternatives for LLM training, emphasizing their efficiency in multilingual settings. They propose that invested resources and tailored pre-training can mitigate the dependence on large parallel datasets for machine translation, favoring monolingual resources instead. However, to challenge larger models such as GPT3 175B effectively, scaling seq2seq models becomes crucial.

Environmental Perspective

Significantly, AlexaTM 20B has a reduced environmental impact compared to larger LLMs, reflecting a more efficient use of computational resources during its training phase.

Ultimately, this paper argues that seq2seq architectures, grounded on careful pre-training objectives, promise potential advancements across language tasks, advocating their role in future AI systems that underline scalability and efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (16)

Saleh Soltan (16 papers)
Shankar Ananthakrishnan (7 papers)
Jack FitzGerald (11 papers)
Rahul Gupta (146 papers)
Wael Hamza (26 papers)
Haidar Khan (21 papers)
Charith Peris (21 papers)
Stephen Rawls (10 papers)
Andy Rosenbaum (10 papers)
Anna Rumshisky (42 papers)
Chandana Satya Prakash (5 papers)
Mukund Sridhar (7 papers)
Fabian Triefenbach (5 papers)
Apurv Verma (9 papers)
Gokhan Tur (47 papers)
Prem Natarajan (32 papers)

Citations (74)

View on Semantic Scholar

Related Papers

PaLM: Scaling Language Modeling with Pathways (2022)
Language Models are Few-Shot Learners (2020)
GPT-NeoX-20B: An Open-Source Autoregressive Language Model (2022)
Few-shot Learning with Multilingual Language Models (2021)
Language Models are Few-shot Multilingual Learners (2021)

Find Related Papers

YouTube

Show All Videos