BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Published 7 Feb 2020 in cs.CL and cs.LG | (2002.02925v4)

Abstract: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (189)

View on Semantic Scholar

Summary

BERT-of-Theseus: A Model Compression Approach without Additional Loss Functions

The paper titled "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" introduces a novel model compression technique coined as Theseus Compression. This technique focuses on compressing BERT, a significant model in natural language processing, by progressively replacing its modules rather than relying on traditional methods such as knowledge distillation (KD) that require additional loss functions.

Overview

BERT-of-Theseus addresses the computational inefficiencies of BERT, which contains millions of parameters, making its deployment in practical applications resource-intensive. Whereas existing methods like KD necessitate sophisticated distillation loss functions to train a smaller model to mimic the larger one, Theseus Compression circumvents this by progressively swapping parts of BERT with smaller substitute modules during training. This approach engenders a deeper interaction between the original and compact models, facilitated by a dynamic replacement strategy akin to curriculum learning, thereby enhancing the successor model's performance without extra supervisory signals.

Methodology

The method utilizes a progressive module replacing strategy inspired by the philosophical paradox "Ship of Theseus." The process involves the following steps:

Module Specification: Each module in the original BERT (predecessor) is assigned a corresponding, smaller successor module.
Progressive Replacement: Throughout training, predecessor modules are stochastically replaced with successor modules with an increasing probability, regulated by a curriculum learning-driven replacement scheduler.
Task-Specific Fine-Tuning: Once the replacement phase converges, the successor modules are fine-tuned collectively using task-specific loss functions like Cross Entropy.

This strategy not only maintains the training similarity of the successor to its predecessor but also introduces regularization effects similar to Dropout, making it robust across varying tasks.

Experimental Insights

The paper conducts extensive evaluations on the GLUE benchmark, a suite of NLP tasks, comparing BERT-of-Theseus to several KD-based methods. The results highlight that this approach achieves approximately 98% of the performance of the original BERT while nearly doubling inference speed. Particularly, on tasks such as QQP, the compressed model even surpasses the original BERT, suggesting an optimized balance between model size and task-specific generalization.

Implications and Future Work

Theseus Compression effectively demonstrates an alternative to KD for model compression, showing promise for its application beyond BERT to other large-scale models, such as those used in computer vision. Its model-agnostic nature and the absence of additional loss functions imply potential applicability across various architectures, including ResNets and Graph Neural Networks.

Future work may explore integrating these strategies with dynamic acceleration methods like early exit mechanisms to further optimize model efficiency. The research also opens avenues for experimenting with in-place substitutes, fostering a versatile framework for compressing and enhancing neural networks in heterogeneous computing environments.

In summary, BERT-of-Theseus proposes a paradigm for model compression that offers efficiency gains without the complexity introduced by additional distillation losses, pointing towards innovative strategies in model optimization and deployment.