Big Bird: Transformers for Longer Sequences (2007.14062v2)

Published 28 Jul 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

PDF Abstract

Big Bird: Transformers for Longer Sequences

The paper "Big Bird: Transformers for Longer Sequences" addresses one of the core limitations of contemporary transformer-based models, such as BERT: their quadratic dependency in terms of memory on the sequence length resulting from the full attention mechanism. BigBird, as proposed by the authors, introduces a sparse attention mechanism that fundamentally reduces this quadratic dependency to linear, making transformers more efficient in handling longer sequences.

Theoretical Foundation and Model Architecture

BigBird is showcased to be a universal approximator of sequence functions and is Turing complete, preserving these critical properties of the quadratic full attention model. The theoretical analysis in this paper highlights the benefits of including $O(1)$ global tokens, like CLS, that attend to the entire sequence, a strategy integrated into BigBird's sparse attention mechanism. The modified attention pattern leverages this sparsity to significantly extend the capabilities of transformers.

In particular, BigBird's architecture consists of three main components:

Global Tokens: A set of $g$ global tokens that attend to all parts of the sequence.
Local Attention: All tokens attend to a set of $w$ local neighboring tokens, thereby maintaining local context.
Random Attention: Each token attends to a set of $r$ random tokens to ensure a broader context distribution.

This attention framework allows BigBird to handle sequences of length up to 8x of what was previously possible using similar hardware.

Empirical Performance

The empirical performance of BigBird spans various NLP tasks, demonstrating its advantage in handling longer contexts:

Masked LLMing (MLM): Pretraining on standard datasets using MLM objective shows that BigBird achieves a better bits-per-character (BPC) score compared to BERT and Longformer, underlining improved contextual representation.
Question Answering (QA): On datasets such as HotpotQA, Natural Questions, TriviaQA, and WikiHop, BigBird outperforms current baseline models. BigBird-ETC, with expanded global tokens, achieved the new state-of-the-art results, significantly boosting performance on tasks requiring longer context.
Document Classification: Using longer sequences in classification tasks such as sentiment analysis and topic assignment, BigBird provides superior performance, especially in scenarios with long documents and fewer training examples.
Summarization: BigBird shows substantial improvements in long document summarization tasks. For datasets like Arxiv, PubMed, BigPatent, and others, BigBird's ability to process longer contexts results in higher ROUGE scores, validating its efficacy for encoder-decoder tasks.

Applications to Genomics

Moving beyond NLP tasks, the paper also ventures into genomics, presenting novel applications where longer contextual sequences are beneficial. The efficacy of BigBird in these domains was evaluated through pretraining on DNA sequences to predict chromatin profiles and identify promoter regions in genomic data.

Promoter Region Prediction: Fine-tuning BigBird pretrained with masked LLMs on DNA sequences shows a substantial improvement in the F1 score for promoter region prediction over previous best methods.
Chromatin-Profile Prediction: BigBird achieves higher AUC scores for transcription factors, histone marks, and DHS profiles, further emphasizing the utility of longer context and extensive pretraining in genomics tasks.

Implementation and Hardware Efficiency

The sparse attention model of BigBird is optimized for modern hardware such as GPUs and TPUs, where operations are aligned more efficiently. By blockifying attention lookups and packing operations into dense tensor multiplications, the model demonstrates both computational and memory efficiency, crucial for handling longer sequences on large-scale data.

Conclusion and Future Work

The paper concludes by noting the broad applicability and improved performance of BigBird over existing models for both typical NLP tasks and specialized applications in genomics. With BigBird's capability to extend context length efficiently, further research could explore even more diverse domains and applications. Future developments in sparse attention mechanisms like BigBird could shape the next generation of transformer models capable of understanding and generating complex, contextually rich sequences in various fields.