Improving BERT with Hybrid Pooling Network and Drop Mask (2307.07258v1)
Abstract: Transformer-based pre-trained LLMs, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked LLMing pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.
- Qian Chen (264 papers)
- Wen Wang (144 papers)
- Qinglin Zhang (30 papers)
- Chong Deng (22 papers)
- Ma Yukun (1 paper)
- Siqi Zheng (61 papers)