Improving Transformer Models by Reordering their Sublayers

Published 10 Nov 2019 in cs.CL and cs.LG | (1911.03864v2)

Abstract: Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (85)

View on Semantic Scholar

Summary

The paper demonstrates that rearranging sublayers in Transformers can lower perplexity by concentrating self-attention at lower levels and feedforward layers at higher levels.
The study uses empirical tests on WikiText-103 and other datasets, revealing that the sandwich Transformer often outperforms the traditional interleaved model.
The research offers practical insights for enhancing training efficiency and encourages further investigation into task-specific optimal sublayer configurations.

Improving Transformer Models by Reordering Their Sublayers

The paper "Improving Transformer Models by Reordering their Sublayers" by Ofir Press, Noah A. Smith, and Omer Levy explores a novel approach to enhancing the performance of Transformer models in natural language processing tasks. The focal point of this research is the reordering of sublayers within the Transformer architecture, which traditionally consists of a repeating pattern of self-attention and feedforward layers. This study investigates whether rearranging these layers could result in improved model performance without increasing computational costs.

Transformers are the cornerstone of recent advancements in NLP due to their capability to manage long-range dependencies in textual data effectively. The canonical Transformer architecture arranges layers in an interleaved pattern of self-attention and feedforward sublayers, yet no evolutionary or architectural imperative dictates that this configuration is optimal. The authors challenge this status quo by experimenting with various permutations of these sublayers, testing whether alternative orderings can achieve lower perplexity in language modeling tasks.

Empirical Findings

Through a series of random trials, Transformer models with shuffled sublayer orderings were tested on the WikiText-103 benchmark. Notably, the analysis revealed that configurations with a higher concentration of self-attention at the lower levels and feedforward layers at the upper levels often outperform the traditional interleaved structure. This insight led to the proposal of the "sandwich Transformer," which is characterized by consecutive blocks of self-attention at the bottom and feedforward sublayers at the top. Pertinently, this reordering did not necessitate additional parameters or training time.

Quantitatively, the experiments reveal that several rearranged models surpassed the standard interleaved model in terms of performance, as evaluated by perplexity—a key metric in language modeling. Specifically, the sandwich Transformer demonstrated improved perplexity across multiple datasets, including WikiText-103, enwik8, and an additional book corpus. Although the rearrangement improved performance in language modeling, it did not universally enhance all types of tasks, such as machine translation, signaling the need for task-specific sublayer configurations.

Theoretical and Practical Implications

The study’s results suggest that the Traditional Transformer sublayer ordering might not be the most efficient for language modeling tasks and that rearranging these sublayers could provide a performance boost. From a theoretical standpoint, this invites further exploration into the role of attention and feedforward layers, their interactions, and their optimal arrangements for various tasks.

Practically, the proposed sandwich Transformer offers an architecture modification with potential benefits in training efficiency, given that no additional parameters or computational resources are required. It emphasizes the importance of architectural innovation, either through manual design or automated architecture search, to maximize the efficacy of deep learning models.

Future Prospects

The authors propose several avenues for future research. First, exploring different reorderings specifically tailored to various domains such as translation, question answering, and different language modeling tasks could yield significant performance improvements. Additionally, integrating sublayer ordering within neural architecture search paradigms could automate the discovery of optimal arrangements, leading to broader usability across AI applications.

In conclusion, "Improving Transformer Models by Reordering Their Sublayers" provides insightful evaluation and experimentation that could influence future work in Transformer design and optimization, showcasing the potential of strategic architectural modifications to achieve enhanced performance in NLP applications.

Markdown Report Issue