A Short Study on Compressing Decoder-Based Language Models (2110.08460v1)

Published 16 Oct 2021 in cs.CL

Abstract: Pre-trained LLMs (PLMs) have been successful for a wide range of NLP tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (9)

Tianda Li (10 papers)
Yassir El Mesbahi (5 papers)
Ivan Kobyzev (23 papers)
Ahmad Rashid (24 papers)
Atif Mahmud (1 paper)
Nithin Anchuri (2 papers)
Habib Hajimolahoseini (10 papers)
Yang Liu (2253 papers)
Mehdi Rezagholizadeh (78 papers)

Citations (24)

View on Semantic Scholar

A Short Study on Compressing Decoder-Based Language Models (2110.08460v1)

Related Papers