What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (2204.05832v1)

Published 12 Apr 2022 in cs.CL, cs.LG, and stat.ML

Abstract: Large pretrained Transformer LLMs have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked LLMing), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive LLMing objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked LLMing objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive LLMing as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

PDF Abstract

Analysis of LLM Architectures and Pretraining Objectives for Zero-Shot Generalization

The paper "What LLM Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?" presents a comprehensive investigation into different Transformer-based LLMs to determine optimal configurations for zero-shot generalization. The authors analyze various model architectures and pretraining objectives, providing detailed insights on their performance both with and without multitask finetuning.

Experimental Setup

The research evaluates three model architectures: causal decoder-only, non-causal decoder-only, and encoder-decoder models. Each architecture is tested with two distinct pretraining objectives: autoregressive LLMing and masked LLMing. Additionally, they explore the impact of multitask prompted finetuning on the zero-shot generalization capability.

The scale of the models under consideration is significant, with experiments involving models exceeding 5 billion parameters trained over 170 billion tokens. Such an extensive scale increases the likelihood that the conclusions will remain relevant as model sizes grow.

Key Findings and Results

After extensive experimental evaluations, several critical observations emerge:

Pretraining Only: Causal decoder-only models trained on a full LLMing objective exhibit superior zero-shot performance immediately after pretraining. This is consistent with the prevailing practice in the field, where autoregressive objectives are commonly used.
Multitask Finetuning: The landscape shifts after multitask finetuning. Non-causal models trained with a masked LLMing objective followed by multitask finetuning show the best performance. This indicates that although causal decoder-only models are initially the best after unsupervised pretraining, non-causal models excel when multitask finetuning is applied.
Model Adaptation: The paper introduces efficient adaptation strategies for pretrained models across different architectures and objectives. It finds that non-causal decoder models with MLM can be efficiently adapted into causal generative models through autoregressive LLMing. Conversely, pretrained causal decoder models can also be transitioned into non-causal configurations with increased efficacy after multitask finetuning.

Practical and Theoretical Implications

The findings highlight the nuanced interactions between model architecture, pretraining objectives, and downstream task setups such as multitask finetuning. Practically, this research suggests pathways for creating models that are both efficient and versatile, serving multiple use cases of zero-shot tasks and generative tasks with equal competence.

Theoretically, this work underscores the complexity inherent in pretraining LLMs and the potential for leveraging architecture-objective adaptation to improve model performance across diverse settings. It questions simplistic modeling choices and points toward the benefits of more dynamic adaptation strategies.

Future Developments

The implications of this research suggest several avenues for future work. These include further refining adaptation techniques to bridge the gap between architectures and objectives efficiently. Additionally, exploring new architectural innovations or pretraining paradigms that can inherently unify the advantages seen in task-specific configurations could be valuable.

In summary, this paper's methodical examination of model architectures and pretraining objectives informs both present practices and future directions in the pursuit of advanced LLMs capable of comprehensive zero-shot generalization.