Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Word Embedding Factorization for Compression Using Distilled Nonlinear Neural Decomposition (1910.06720v2)

Published 2 Oct 2019 in cs.CL and cs.LG

Abstract: Word-embeddings are vital components of NLP models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for LLMs and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and LLMing, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on LLMing compared to complex, difficult to tune state-of-the-art methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vasileios Lioutas (16 papers)
  2. Ahmad Rashid (24 papers)
  3. Krtin Kumar (4 papers)
  4. Mehdi Rezagholizadeh (78 papers)
  5. Md Akmal Haidar (6 papers)
Citations (9)