Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers (2211.11586v1)

Published 17 Nov 2022 in cs.CL and cs.LG

Abstract: Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhewei Yao (64 papers)
  2. Xiaoxia Wu (30 papers)
  3. Conglong Li (15 papers)
  4. Connor Holmes (20 papers)
  5. Minjia Zhang (54 papers)
  6. Cheng Li (1094 papers)
  7. Yuxiong He (59 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.