Finetuning Pretrained Transformers into RNNs (2103.13076v2)

Published 24 Mar 2021 in cs.CL

Abstract: Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process has lower training cost relative to training these recurrent variants from scratch. As many models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.

PDF Abstract

Finetuning Pretrained Transformers into RNNs: Enhancing Efficiency in Autoregressive Generation

This paper presents a method, Transformer-to-RNN (T2R), that converts a pretrained transformer model into a recurrent neural network (RNN) to achieve efficient inference in autoregressive generation tasks such as LLMing and machine translation. This is accomplished through a swap-then-finetune process, where the standard attention mechanism in transformers—characterized by quadratic complexity in sequence length—is replaced with a linear-complexity alternative, followed by tuning the parameters of the modified model.

Key Insights and Contributions

The transformer architecture is well-recognized for its success across various NLP tasks, owing to its attention mechanism that facilitates interactions among input feature vectors. However, it incurs significant computational costs due to its quadratic complexity concerning sequence length. The T2R approach directly addresses this inefficiency by converting transformers into RNNs, which maintain a fixed-size recurrent state to present a substantial reduction in both time and space complexity.

Transformation Methodology: The central innovation of the paper is the modification to attention computation. The authors replace the conventional softmax attention, which scales exponentially, with a linear-complexity alternative through a single-layer MLP feature map followed by dot-product similarity, which is then finetuned. This strategic replacement allows for reduced computational costs while maintaining model accuracy.
Efficiency Gains: Experiments demonstrate that T2R facilitates significant efficiency improvements by reducing GPU time required for model training relative to training an RNN model from scratch. LLMing on the WikiText-103 dataset showed that finetuning a pretrained transformer using T2R yields a validation perplexity of 19.6, contrasting to 20.8 from similar models trained from scratch. Moreover, speed and memory benefits are underpinned by experiments where the T2R model achieves a 15% speedup over previously proposed efficient transformer variants.
Maintained Accuracy: While efficiency is improved, accuracy retention remains critical, particularly in standard benchmarks for LLMing and machine translation. For instance, test scores for WMT14 EN-DE and EN-FR tasks indicate only marginally lesser BLEU scores for T2R compared to the original large-sized transformers. The T2R approach, notably when some layers are retained as traditional transformer layers, can achieve similar accuracy levels.

Implications and Future Directions

The development of T2R demonstrates a practical approach to leveraging the benefits of pretrained models while optimizing for inference efficiency, particularly in contexts where computational resources or real-time processing demands are significant constraints. This is particularly relevant given the increasing prevalence of large-scale pretrained models in NLP applications.

Pretraining Benefits: The results bolster the proposition that large-scale pretrained models can be effectively reused in efficient model variants, thereby saving the resources typically consumed during the training of large architectures.
Attention Mechanism Exploration: The implications of T2R extend to the attention mechanisms, encouraging exploration of optimized attention schemes that strike a balance between computational costs and model performance, potentially involving dynamic or adaptive attention sparsification strategies.
Broader Applicability: Beyond the scope of autoregressive generation, the methodology can potentially be adapted to other sequential tasks or even non-sequential transformer applications, pending further research into cross-domain generalization.

Summarily, this paper articulates a sophisticated technique to pragmatically enhance the efficiency of transformer models in autoregressive tasks without compromising their performance, presenting avenues for continued research into efficient model training and deployment within the contextual framework of formidable computational paradigms.