Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing (2006.03236v1)

Published 5 Jun 2020 in cs.LG, cs.CL, and stat.ML

Abstract: With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.

Citations (218)

View on Semantic Scholar

Summary

The paper presents an innovative architecture that progressively compresses input sequences to eliminate redundancy.
The methodology reinvests saved computational resources by deepening or widening the network, boosting performance.
Empirical results reveal significant speedups in pretraining and finetuning while maintaining token-level prediction accuracy.

Funnel-Transformer: Efficient Language Processing via Sequential Redundancy Reduction

The paper "Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing" by Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V. Le introduces an innovative architecture for LLMs, the Funnel-Transformer (F-TFM). This model is designed to enhance computational efficiency by addressing the often-ignored sequential redundancy in existing Transformer architectures. The Funnel-Transformer achieves efficiency gains by compressing the sequence length progressively and reallocating computational resources to increase model depth or width, resulting in improved model capacity without additional computational burden.

Key Contributions

The Funnel-Transformer offers several novel contributions to the landscape of language processing models:

Progressive Sequence Length Compression: The architecture reduces the sequence length progressively across layers using a pooling mechanism. This strategy reduces the complexity associated with sequence length, as the cost of operation is super-linear with respect to the sequence length.
Reinvestment in Model Capacity: The resources saved through sequence compression are reinvested in enhancing the model's capacity, either by deepening or widening the network. This reinvestment strategy enables a balance between efficiency and capacity to improve performance on a variety of language tasks.
Adaptive Decoder for Token-Level Predictions: Despite reducing the sequence length, the model retains the ability to perform token-level predictions. This is achieved through a decoder that reconstructs token-level hidden representations, necessary for tasks such as masked LLMing (MLM).

Empirical Evaluation

The Funnel-Transformer is evaluated extensively under two pretraining settings—Base scale and Large scale—and compared against standard Transformer baselines. The empirical results demonstrate several key findings:

Efficiency Gains Across Benchmarks: When benchmarked against traditional Transformer models across multiple tasks including GLUE, text classification, and reading comprehension, Funnel-Transformer consistently outperforms similar-scaled models in both accuracy and computational efficiency.
Pretraining and Finetuning Speed: The model yields significant pretraining and finetuning speedups on both GPUs and TPUs due to its reduced sequence length. This results in a more efficient use of resources, translating to reduced training times without compromising performance compared to conventional Transformers.
Broad Applicability: The paper highlights Funnel-Transformer's effectiveness for a range of applications, demonstrating improved results over state-of-the-art models on sequence-level tasks such as classification and reading comprehension, along with token-level tasks like SQuAD.

Theoretical and Practical Implications

The Funnel-Transformer presents both theoretical and practical advancements:

Theoretical Insights: The model offers insights into the balance between sequential representation granularity and model capacity. By effectively reducing redundancy, the architecture challenges the necessity of maintaining full-length token representations across all layers for sequence-level tasks.
Practical Implications: In practice, the Funnel-Transformer provides a viable route to make large-scale LLMs more deployable across devices with limited computational resources. This makes it especially relevant for real-world applications where computational efficiency is as critical as accuracy.

Future Directions

The paper opens several avenues for future research, including exploring optimized block layouts to further balance depth and resource savings, integrating with model compression techniques like distillation and quantization, and extending the architecture to sequence-to-sequence tasks such as translation or summarization. Additionally, continued research can investigate alternative pooling strategies and decoder designs to enhance performance in token-level tasks further.

In summary, the Funnel-Transformer sets a crucial precedent for the development of future state-of-the-art models by showcasing how thoughtful architecture changes can significantly impact not only the efficiency but also the effectiveness of language processing systems.

PDF Markdown

Related Papers

GitHub

GitHub - laiguokun/Funnel-Transformer (218 stars)

Tweets

https://twitter.com/iskander/status/1760707669667774505

YouTube

Show All Videos