Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Just CHOP: Embarrassingly Simple LLM Compression (2305.14864v3)

Published 24 May 2023 in cs.CL

Abstract: LLMs enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical step in the compression process, the pretrain-then-finetune paradigm, has largely been overlooked when adapting existing pruning strategies to LLMs or proposing new ones. In this work, we show that embarrassingly simple layer pruning coupled with an extended LLM pretraining as the finetuning phase produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale while being more inference efficient. We call this method LayerChop, where we deterministically remove layers from a model followed by task-agnostic finetuning of the remaining weights by continued self-supervised pretraining. At this scale, we also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ananya Harsh Jha (8 papers)
  2. Tom Sherborne (15 papers)
  3. Evan Pete Walsh (3 papers)
  4. Dirk Groeneveld (19 papers)
  5. Emma Strubell (60 papers)
  6. Iz Beltagy (39 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets