Universal Transformers (1807.03819v3)

Published 10 Jul 2018 in cs.CL, cs.LG, and stat.ML

Abstract: Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA LLMing task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.

PDF Abstract

Universal Transformers

The paper "Universal Transformers" introduces a new model architecture aimed at addressing some of the limitations of both Recurrent Neural Networks (RNNs) and the Transformer model in sequence modeling tasks. The Universal Transformer (UT) is proposed as a generalization of the Transformer, incorporating recurrence in depth while leveraging the strengths of self-attention mechanisms.

Model Architecture and Concept

The Universal Transformer architecture is built upon the concepts of iterative refinement and dynamic computation. At its core, the UT combines the parallel-in-time processing capabilities of the Transformer with the recurrent update mechanism traditionally associated with RNNs. Specifically, the UT applies self-attention at each step within a recurrent structure, iteratively refining the representation of each symbol in the input sequence across multiple steps.

Key characteristics of the Universal Transformer include:

Self-Attention Mechanism: Like the standard Transformer, the UT employs a self-attention mechanism to capture dependencies across the sequence. This allows each symbol's representation to be updated based on the entire sequence's context.
Recurrent Processing: Unlike the Transformer with its fixed number of layers, the UT performs recurrent processing at each layer, allowing for potentially unbounded depth in terms of the number of processing steps.
Dynamic Halting: To optimize computation, the UT incorporates a dynamic per-position halting mechanism based on Adaptive Computation Time (ACT). This allows the model to determine the required number of recurrent steps for each symbol independently.

Performance on Algorithmic and Language Understanding Tasks

The UT's efficacy is demonstrated across several algorithmic and language understanding tasks, where it outperforms both the standard Transformer and LSTM architectures. Key results include:

Algorithmic Tasks: On tasks such as Copy, Reverse, and Addition, the UT achieves significantly higher accuracy, effectively generalizing to longer sequences beyond the training data.
bAbI Question-Answering: The UT demonstrates superior performance on the bAbI dataset, achieving state-of-the-art results across multiple tasks, both in the 10K and 1K training example regimes.
Subject-Verb Agreement: The UT excels in the subject-verb agreement task, particularly as the complexity increases with the number of agreement attractors, showcasing its proficiency in capturing hierarchical linguistic dependencies.
LAMBADA LLMing: On the challenging LAMBADA dataset, the UT sets a new benchmark, highlighting its ability to incorporate broader discourse context in LLMing tasks.
Machine Translation: The UT delivers improvements in machine translation tasks, with a notable 0.9 BLEU score increase over the standard Transformer on the WMT14 En-De dataset.

Implications and Future Directions

The Universal Transformer model bridges the gap between practical sequence models like Transformers and computationally universal models such as Neural Turing Machines. The incorporation of recurrent processing in depth and dynamic computation enables the UT to achieve high computational expressivity without compromising on efficiency.

Theoretical Implications: The UT's theoretical foundation is strengthened by its Turing completeness under certain conditions. This implies that the model can simulate any computation given sufficient memory and time, broadening its applicability in complex sequence modeling tasks.

Practical Applications: The demonstrated improvements across a range of tasks indicate that the UT can be effectively deployed in both small-scale structured tasks and large-scale natural language processing applications. The dynamic halting mechanism also promises computational efficiency by adapting the processing depth based on input complexity.

Future Directions:

Optimization: Further research could explore optimizing the dynamic halting mechanism for even greater computational efficiency.
Complexity Handling: Enhancing the UT's handling of highly complex sequences with varying lengths and structures could broaden its applicability in real-world tasks.
Merging with Other Architectures: Integrating the strengths of other advanced architectures like Transformers and Graph Neural Networks with the UT's core mechanisms could yield models with superior performance across even more diverse tasks.

In conclusion, the Universal Transformer offers significant advancements in sequence modeling by combining the benefits of self-attention and recurrent processing. Its ability to dynamically adjust computation depth makes it a powerful tool for a wide range of applications, from algorithmic tasks to complex natural language understanding.