DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Published 2 May 2020 in cs.CL, cs.AI, and cs.LG | (2005.00697v1)

Abstract: Transformer-based QA models use input-wide self-attention -- i.e. across both the question and the input passage -- at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime compute drastically. Furthermore, because DeFormer is largely similar to the original model, we can initialize DeFormer with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy. We open source the code at https://github.com/StonyBrookNLP/deformer.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (64)

View on Semantic Scholar

Summary

Insights and Implications of DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

The paper "DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering" introduces an innovative approach to augment the efficiency of Transformer-based models, such as BERT and XLNet, in question answering (QA) tasks. The primary contribution is the introduction of DeFormer—a model that leverages a decomposition technique to curtail the computational demands of transformers, aiming to enhance inference speed without compromising significantly on accuracy.

Overview of DeFormer Architecture

The proposed DeFormer model addresses the computational bottleneck inherent in Transformer models stemming from the extensive input-wide self-attention mechanism across various layers. This model substitutes full self-attention with segmented self-attentions—question-wise and passage-wise—in the lower layers of the architecture. The pivotal innovation here is that the lower layers process the question and the context (passage) independently, while the higher layers integrate them, retaining the full attention mechanism to maintain the model's efficacy.

The decomposition effectively enables pre-computation of passage representations offline, thus significantly reducing runtime computational load. This is crucial for scaling QA systems to handle large datasets and for deploying them on resource-constrained devices such as smartphones.

Empirical Evaluation

Empirical results indicate that DeFormer achieves substantial efficiency improvements. In particular, DeFormer versions of BERT and XLNet deliver up to 4.3x speed improvements while maintaining accuracy losses within 1% of the original systems' performance for standard QA tasks.

The efficiency gains are attributed to:

Reduced FLOPs due to the decomposed architecture.
Lower memory usage as intermediate passage representations do not require storage during runtime.

Notably, DeFormer versions of BERT-large executed faster than the original BERT-base models, and with better accuracy, illustrating the dual benefit of high efficiency coupled with preserved effectiveness.

Auxiliary Supervision

To mitigate potential accuracy losses due to decomposition, the paper introduces auxiliary supervision strategies. It employs knowledge distillation and layer-wise representation similarity losses, encouraging DeFormer to mimic the prediction patterns of its original non-decomposed counterpart more closely. These auxiliary losses are found to enhance the model’s ability to retain more of the original model's representational capabilities.

Implications and Future Directions

The ability to fast-track inference while maintaining a lean computational profile has far-reaching implications:

Scalability: Models like DeFormer can be deployed in settings with vast data input scales due to reduced operational costs.
Mobile Application: The decomposed approach positions QA systems for effective deployment in resource-constrained environments, enhancing privacy by enabling on-device processing.
Broader Applications: Although the focus here is question answering, the decomposition technique may generalize to other NLP tasks characterized by paired-input architectures, suggesting pathways for further research and application.

In future explorations, DeFormer could complement or integrate with other model efficiency techniques such as head pruning or low-rank factorization for further improvement. Additionally, the decomposition strategy can motivate a rethinking of architectural designs for upcoming Transformer-based systems targeting optimal resource utilization without full reliance on the extensive pre-training phase.

The paper offers a significant advancement in the quest for efficient, faster QA systems, advocating for innovative architectural reconfigurations and setting a precedent for subsequent work in the field.

Markdown Report Issue