Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TrimBERT: Tailoring BERT for Trade-offs (2202.12411v1)

Published 24 Feb 2022 in cs.CL

Abstract: Models based on BERT have been extremely successful in solving a variety of NLP tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks while significantly decreasing model size and training time. We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative and removing half of all layernorm operations. This further decreases the training time while maintaining a high level of fine-tuning accuracy.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sharath Nittur Sridhar (16 papers)
  2. Anthony Sarah (10 papers)
  3. Sairam Sundaresan (17 papers)
Citations (4)