Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning (2309.08708v1)

Published 15 Sep 2023 in cs.CL

Abstract: The extensive memory footprint of pre-trained LLMs (PLMs) can hinder deployment in memory-constrained settings, such as cloud environments or on-device. PLMs use embedding matrices to represent extensive vocabularies, forming a large proportion of the model parameters. While previous work towards parameter-efficient PLM development has considered pruning parameters within the transformer layers, pruning the embedding matrix as part of fine-tuning or inference has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused in these scenarios. We then propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach maintains equivalent downstream task performance while allowing a more efficient use of compute resources.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Miles Williams (5 papers)
  2. Nikolaos Aletras (72 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.