Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving BERT with Hybrid Pooling Network and Drop Mask (2307.07258v1)

Published 14 Jul 2023 in cs.CL

Abstract: Transformer-based pre-trained LLMs, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked LLMing pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qian Chen (264 papers)
  2. Wen Wang (144 papers)
  3. Qinglin Zhang (30 papers)
  4. Chong Deng (22 papers)
  5. Ma Yukun (1 paper)
  6. Siqi Zheng (61 papers)