Do Transformer Modifications Transfer Across Implementations and Applications? (2102.11972v2)

Published 23 Feb 2021 in cs.LG and cs.CL

Abstract: The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.

Citations (116)

View on Semantic Scholar

Summary

The paper shows that most Transformer modifications do not yield significant performance improvements across diverse NLP tasks.
It employs a unified experimental framework with consistent hyperparameters using both transfer learning and direct supervision to assess each variant.
Findings highlight that activation replacements (e.g., SwiGLU, GeGLU) and RMS normalization offer benefits, while parameter sharing typically hinders performance.

Overview of the Study on Transformer Modifications

The paper "Do Transformer Modifications Transfer Across Implementations and Applications?" provides a rigorous exploration of various modifications proposed to the Transformer architecture, a staple in modern NLP. Despite the numerous architectural modifications suggested in the literature since the inception of Transformers, widespread adoption of these modifications remains minimal. The authors investigate this phenomenon by extensively evaluating a range of Transformer variants under a unified experimental framework, encompassing diverse NLP tasks.

Key Results

The primary outcome of this paper reveals that the majority of Transformer modifications do not yield significant performance improvements. More notably, enhancements that did improve performance were generally minor modifications or originated from the same codebase employed in this paper. The experiments were executed using a consistent set of parameters, using both transfer learning and direct supervised tasks to evaluate the effectiveness of various architectural changes.

Detailed Examination of Modifications

The paper categorized their investigation into several areas of Transformer modifications, covering alterations in activation functions, normalization techniques, model depth, embedding strategies, parameter sharing, and softmax computations. Among these, the activation function replacements such as SwiGLU and GeGLU provided consistent improvements during both pre-training and fine-tuning. Additionally, implementations using RMS normalization showed improved training speed and effectiveness over conventional methods.

Additionally, parameter sharing strategies, such as those inspired by ALBERT, generally hampered performance. Changes in softmax layers, notably through mixture of softmaxes, did offer qualitative performance boosts, but at the cost of computational efficiency, being notably slower. Larger architectures and those derived within the same implementation, like the Switch Transformer and variations of the Synthesizer, managed to offer some advantage, albeit at higher parameter counts.

Implications and Conjectures

The lack of generalizability and performance improvements for most Transformer modifications suggests that these alterations may not transfer well across different implementations or task settings. This investigation underscores the need for methodologies ensuring architectural robustness across varying contexts.

Recommendations for Future Research

From a broader perspective, the paper proposes several guidelines to enhance the robustness of Transformer modifications. Future architectural advancements should be tested within multiple, independent codebases and span a diverse array of task types, potentially including domains beyond NLP, such as computer vision. Importantly, maintaining consistent hyperparameter settings across these tests is crucial for isolating the true impact of any architectural modifications—the ability of a particular change to perform well under varied settings is an indicator of robustness. Furthermore, to solidify trust in the results and guide best practices, comprehensive reporting should include statistical measures of variability across multiple experiment runs.

The results and methodology outlined in this paper provide a substantial foundation for understanding the impact of Transformer modifications and will aid in shaping future research efforts towards enhancing this widely-utilized neural architecture.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iwontbecreative/status/1850748872601059546

https://twitter.com/thecharlieblake/status/1797910301293301825

YouTube

Show All Videos