Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kronecker Decomposition for GPT Compression (2110.08152v1)

Published 15 Oct 2021 in cs.CL

Abstract: GPT is an auto-regressive Transformer-based pre-trained LLM which has attracted a lot of attention in the NLP domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from ~100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both LLMing and General Language Understanding Evaluation benchmark tasks and show that with more efficient pre-training and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ali Edalati (6 papers)
  2. Marzieh Tahaei (8 papers)
  3. Ahmad Rashid (24 papers)
  4. Vahid Partovi Nia (40 papers)
  5. James J. Clark (32 papers)
  6. Mehdi Rezagholizadeh (78 papers)
Citations (29)