Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? (2207.10551v1)

Published 21 Jul 2022 in cs.LG and cs.CL

Abstract: There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.

PDF Abstract

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

The paper "Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?" authored by Yi Tay et al., contributes a systematic paper of the scaling behavior of various model architectures in NLP. It extends existing research on the empirical scaling properties of models like Transformers by investigating how different inductive biases and model architectures respond to scaling. This exploration encompasses multiple compute regions and scales, from 15 million to 40 billion parameters.

Key Contributions and Findings

Divergence in Scaling Laws Across Models:
- The research compares ten diverse model architectures, including several Transformer variants (e.g., Universal Transformers, Switch Transformers), efficient models (e.g., Performer, Funnel Transformer), convolution-based models (e.g., Dynamic Convolutions, Lightweight Convolutions), and recently proposed architectures (e.g., MLP-Mixers).
- A critical insight is that different model architectures exhibit unique scaling behaviors. Notably, the vanilla Transformer displays robust scaling properties. However, alternative architectures such as Dynamic Convolutions and Mixers do not scale as efficiently with increased compute.
Influence of Inductive Bias on Scaling:
- The paper underscores the fact that the inductive bias inherent in model architectures significantly impacts scaling behavior. For instance, models like Universal Transformers and ALBERT, which feature parameter sharing, demonstrate different scaling dynamics compared to standard Transformers.
Difference in Scaling Between Upstream and Downstream Tasks:
- The paper presents a crucial observation that models performing well on upstream tasks (pre-training negative log-perplexity) may not necessarily excel in downstream transfer tasks. This indicates the importance of evaluating models on both pre-training efficiency and fine-tuning effectiveness.
Variability Across Compute Regions:
- The research shows that the best-performing model is not constant across different compute scales. For example, Evolved Transformers outperform vanilla Transformers at smaller compute scales but not at larger scales. This variability indicates the necessity for evaluating model performance across a range of scales.
Preliminary Guidelines for Practitioners:
- The research provides practical advice for researchers and practitioners in model design and evaluation. It emphasizes the risks associated with staking large computational resources on architectures with unproven scaling properties and suggests that models with minimal changes to the standard Transformer architecture are more likely to scale predictably.

Experimental Setup and Methodology

Models Evaluated:
- The paper considers architectures such as vanilla Transformers, Evolved and Universal Transformers, Switch Transformers, Performer, Funnel Transformer, ALBERT, Dynamic Convolutions, Lightweight Convolutions, and MLP-Mixers.
- The experiments implement these models within a sequence-to-sequence framework, adhering to the conventions of T5. This encompasses uniform scaling of model sizes from tiny (15M parameters) to XL (40B parameters).
Comprehensive Evaluation Metrics:
- The paper evaluates models on both upstream metrics (e.g., negative log-perplexity) and downstream metrics (e.g., accuracy on GLUE, SuperGLUE, and SQuAD tasks), providing a holistic view of model performance across pre-training and fine-tuning stages.

Implications and Future Directions

The findings presented in this paper have substantial implications for both theoretical understanding and practical application in the domain of large-scale NLP models:

Model Selection and Resource Allocation:
- The results suggest that researchers should carefully select model architectures based on their scaling properties across different compute regions. This consideration is especially vital for projects with significant resource constraints.
Design of New Architectures:
- The variability in scaling characteristics among different models indicates a need for more nuanced design and evaluation frameworks when developing new model architectures. Future research should aim at creating models that perform well across varying computational budgets.
Highlighting the Complexity of Scaling:
- The paper’s extensive experimental analysis highlights the intricacies associated with scaling NLP models and the necessity for thorough benchmarking at multiple scales.

In conclusion, the paper by Tay et al. makes a significant contribution to the understanding of how inductive bias influences the scaling of NLP model architectures. It provides valuable insights and practical recommendations for researchers and practitioners, emphasizing the importance of comprehensive evaluation across both upstream and downstream tasks and advocating for caution in the adoption of novel, untested architectures at large scales. Future developments in AI should continue to explore and address the challenges of model scaling with an emphasis on creating robust, scalable, and efficient architectures.