The paper "DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering" introduces an innovative approach to augment the efficiency of Transformer-based models, such as BERT and XLNet, in question answering (QA) tasks. The primary contribution is the introduction of DeFormer—a model that leverages a decomposition technique to curtail the computational demands of transformers, aiming to enhance inference speed without compromising significantly on accuracy.
The proposed DeFormer model addresses the computational bottleneck inherent in Transformer models stemming from the extensive input-wide self-attention mechanism across various layers. This model substitutes full self-attention with segmented self-attentions—question-wise and passage-wise—in the lower layers of the architecture. The pivotal innovation here is that the lower layers process the question and the context (passage) independently, while the higher layers integrate them, retaining the full attention mechanism to maintain the model's efficacy.
The decomposition effectively enables pre-computation of passage representations offline, thus significantly reducing runtime computational load. This is crucial for scaling QA systems to handle large datasets and for deploying them on resource-constrained devices such as smartphones.
Empirical Evaluation
Empirical results indicate that DeFormer achieves substantial efficiency improvements. In particular, DeFormer versions of BERT and XLNet deliver up to 4.3x speed improvements while maintaining accuracy losses within 1% of the original systems' performance for standard QA tasks.
The efficiency gains are attributed to:
- Reduced FLOPs due to the decomposed architecture.
- Lower memory usage as intermediate passage representations do not require storage during runtime.
Notably, DeFormer versions of BERT-large executed faster than the original BERT-base models, and with better accuracy, illustrating the dual benefit of high efficiency coupled with preserved effectiveness.
Auxiliary Supervision
To mitigate potential accuracy losses due to decomposition, the paper introduces auxiliary supervision strategies. It employs knowledge distillation and layer-wise representation similarity losses, encouraging DeFormer to mimic the prediction patterns of its original non-decomposed counterpart more closely. These auxiliary losses are found to enhance the model’s ability to retain more of the original model's representational capabilities.
Implications and Future Directions
The ability to fast-track inference while maintaining a lean computational profile has far-reaching implications:
- Scalability: Models like DeFormer can be deployed in settings with vast data input scales due to reduced operational costs.
- Mobile Application: The decomposed approach positions QA systems for effective deployment in resource-constrained environments, enhancing privacy by enabling on-device processing.
- Broader Applications: Although the focus here is question answering, the decomposition technique may generalize to other NLP tasks characterized by paired-input architectures, suggesting pathways for further research and application.
In future explorations, DeFormer could complement or integrate with other model efficiency techniques such as head pruning or low-rank factorization for further improvement. Additionally, the decomposition strategy can motivate a rethinking of architectural designs for upcoming Transformer-based systems targeting optimal resource utilization without full reliance on the extensive pre-training phase.
The paper offers a significant advancement in the quest for efficient, faster QA systems, advocating for innovative architectural reconfigurations and setting a precedent for subsequent work in the field.