When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute (2102.12459v3)

Published 24 Feb 2021 in cs.CL and cs.LG

Abstract: LLMs have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard LLMing tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.

Citations (45)

View on Semantic Scholar

Summary

The paper presents SRU++, a novel architecture that integrates fast recurrence with minimal attention to drastically reduce training compute.
Empirical evaluations show that SRU++ outperforms traditional models on benchmarks like enwik8, Wiki-103, and the Billion Word dataset with significantly lower compute costs.
The study challenges full attention dependency by demonstrating that limited attention in recurrent models yields competitive performance and promotes sustainable AI.

An Essay on "When Attention Meets Fast Recurrence: Training LLMs with Reduced Compute"

The paper "When Attention Meets Fast Recurrence: Training LLMs with Reduced Compute" by Tao Lei introduces SRU++, an innovative architecture aiming to enhance computational efficiency in training LLMs. SRU++ is designed to address the increasing computational costs associated with the prevalent Transformer architectures by combining fast recurrence mechanisms with minimal attention layers.

Architectural Overview

SRU++ builds upon the Simple Recurrent Unit (SRU), incorporating self-attention to improve modeling capacity while maintaining computational efficiency. This design strategically leverages fast recurrence to encode sequential patterns efficiently and introduces attention sparingly, thus reducing the dependency on the traditionally heavy computational load of full attention mechanisms.

Empirical Evaluation

The effectiveness of SRU++ is demonstrated across several standard benchmarks: enwik8, Wiki-103, and the Billion Word dataset, all of which are established tasks in the domain of LLMing. The model showcases improvements in bits-per-character (BPC) and perplexity scores, achieving these results with 3x-10x less computational cost than existing top-performing Transformer models. Notably, on the enwik8 dataset, SRU++ achieves a state-of-the-art result with significantly reduced training resources.

A key insight from the paper is that SRU++ requires minimal attention to approach state-of-the-art performance. The experiments reveal that placing a couple of attention layers in the higher levels of the model suffices for capturing long-range dependencies, further enabling computational savings during training and inference phases.

Theoretical and Practical Implications

The introduction of SRU++ has both theoretical and practical implications. Theoretically, it revisits the role of attention in sequence modeling, suggesting that attention might not be the sole requirement for high-capacity models. By demonstrating the efficacy of combining recurrence with attention, the paper challenges the orthodoxy of attention-only architectures like the Transformer.

Practically, SRU++ provides a pathway toward more environmentally sustainable AI, given the substantial reduction in computational resources. This is particularly pertinent as the AI research community grapples with the ecological impact of extensive model training regimes. Moreover, SRU++ can accelerate inference speeds, offering tangible benefits in applications where latency is critical.

Future Directions

The paper opens several avenues for future work. Enhancements to the SRU++ architecture can be explored through more advanced attention mechanisms or improved recurrent implementations. Additionally, the model can incorporate newly proposed optimization techniques to further streamline computation. Furthermore, SRU++ can be adapted and evaluated across a broader range of NLP tasks beyond LLMing, such as machine translation and dialogue systems.

In conclusion, this work presents a compelling alternative to traditional attention-heavy architectures, showcasing that combining fast recurrence with minimal attention can yield competitive, if not superior, results at a fraction of the computational cost. This approach not only advances the efficiency of LLMs but also contributes to the broader effort in creating sustainable AI technologies.

PDF Markdown

Related Papers

GitHub

GitHub - asappresearch/sru: Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755) (2,099 stars)

YouTube

Show All Videos