BERT Loses Patience: Fast and Robust Inference with Early Exit (2006.04152v3)

Published 7 Jun 2020 in cs.CL and cs.LG

Abstract: In this paper, we propose Patience-based Early Exit, a straightforward yet effective inference method that can be used as a plug-and-play technique to simultaneously improve the efficiency and robustness of a pretrained LLM (PLM). To achieve this, our approach couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers remain unchanged for a pre-defined number of steps. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers. Meanwhile, experimental results with an ALBERT model show that our method can improve the accuracy and robustness of the model by preventing it from overthinking and exploiting multiple classifiers for prediction, yielding a better accuracy-speed trade-off compared to existing early exit methods.

Citations (309)

View on Semantic Scholar

Summary

The paper introduces a novel Patience-based Early Exit (PABEE) mechanism that dynamically terminates inference when predictions stabilize to preserve accuracy while boosting speed.
Experimental results on the GLUE benchmark show that PABEE enhances the trade-off between inference efficiency and predictive performance in models like ALBERT.
The plug-and-play design of PABEE improves adversarial robustness, making it practical for deploying large language models in resource-limited environments.

Summary of "BERT Loses Patience: Fast and Robust Inference with Early Exit"

The paper "BERT Loses Patience: Fast and Robust Inference with Early Exit" introduces a novel method called Patience-based Early Exit (PABEE) to enhance the inference efficiency of Pretrained LLMs (PLMs) without compromising their accuracy. PLMs, including BERT, ALBERT, and others, are widely utilized within the field of NLP but often suffer from inefficiency due to large parameter sets and extensive computational requirements. This inefficiency is particularly notable during inference, where these models are deployed to understand and process textual input. The authors propose an innovative approach to address this inefficiency by implementing an adaptive mechanism that optimally determines when an inference task can terminate early without significantly affecting predictive accuracy.

Key Contributions

Patience-based Early Exit Mechanism: The authors propose coupling each layer of a PLM with an internal classifier. PABEE dynamically stops inference when the predictions across consecutive internal classifiers stabilize (i.e., remain unchanged for a predefined patience threshold). This method mitigates the overthinking problem where deeper layers can sometimes detract from generalization due to spurious intricacies.
Experimental Validation: PABEE is empirically validated on the GLUE benchmark using an ALBERT model, showing that the approach enhances both inference efficiency and model robustness. The results indicate improvements in the trade-off between accuracy and inference speed compared to existing early exit methodologies.
Methodological Simplicity: The approach is plug-and-play, necessitating only minimal additional parameters and training adjustments while providing the flexibility to adjust the trade-off between speed and accuracy based on resource availability and computational constraints.
Adversarial Robustness: PABEE demonstrates superior robustness against adversarial attacks compared to baseline models. This is attributed to its multi-classifier exit strategy, which requires an attacker to fool not one but multiple classifiers, mitigating vulnerabilities characteristic of models solely reliant on the final layer's prediction.

Theoretical Insights

The paper's theoretical contribution involves demonstrating that the method can statistically outperform conventional inference in terms of accuracy, under reasonable assumptions regarding classifier error rates. The inequity in computational resources required by traditional models versus those employing PABEE highlights its practical importance in scenarios demanding high efficiency, such as mobile and edge computing.

Practical Implications and Future Prospects

PABEE's adaptability to resource availability makes it especially suited to real-world applications where the trade-offs between latency and computational power are paramount. This has implications for the deployment of PLMs in low-resource environments, broadening their accessibility and usability. Furthermore, while primarily validated on NLP tasks, the method shows promise in computer vision contexts, such as with convolutional networks, underscored by additional experiments with ResNet.

Additionally, as the field of artificial intelligence continues to evolve towards leveraging ever-larger and deeper models, techniques like PABEE provide valuable insights into maintaining efficiency without scaling down model architecture complexity—an essential consideration given the trend towards models with billions of parameters.

In conclusion, "BERT Loses Patience" presents an effective approach to balance the accuracy-speed trade-off in PLMs. Future research could further explore diverse model architectures and additional tasks to fully leverage PABEE's benefits across different domains while addressing challenges in multi-branch network applications.

PDF Markdown