Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured Pruning Learns Compact and Accurate Models (2204.00408v3)

Published 1 Apr 2022 in cs.CL and cs.LG

Abstract: The growing size of neural LLMs has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches.

An Expert Overview of "Structured Pruning Learns Compact and Accurate Models"

The paper "Structured Pruning Learns Compact and Accurate Models" by Mengzhou Xia, Zexuan Zhong, and Danqi Chen from Princeton University presents a focused investigation into model compression by structured pruning of neural LLMs. The fundamental challenge addressed is balancing the trade-offs between model size reduction, accuracy retention, and computational efficiency. The paper introduces a pruning approach named CoFi (Coarse- and Fine-grained Pruning), which provides a nuanced solution that claims to be competitive with existing distillation methods, offering substantial inference speedups without the need for large volumes of unlabeled data.

Methodology

The paper proposes the CoFi method, which performs task-specific structured pruning across different levels of granularity, encompassing coarse-grained elements (such as entire layers in a network) and fine-grained components (such as individual attention heads and hidden units). The innovation lies in the joint application of coarse and fine-grained pruning, controlled by distinct mask variables that dictate which model parameters should be pruned.

Additionally, CoFi employs a layerwise distillation strategy that aids in transferring knowledge from an unpruned (teacher) to a pruned (student) model during training. Unlike traditional distillation approaches, which rely on pre-defined architectural decisions for the student, CoFi dynamically adjusts the layer mappings throughout the training process. This adaptive mechanism allows for achieving high compression rates while maintaining performance consistency.

Experimental Findings

The empirical evaluation conducted on benchmark datasets, including GLUE and SQuAD, demonstrates CoFi's capabilities. The paper reports that CoFi can achieve model sparsities of over 95% while preserving more than 90% of the original model's accuracy, along with significant inference speedups exceeding 10×. In direct comparative studies with state-of-the-art distillation methods - such as TinyBERT and MobileBERT - CoFi models performed comparably or better without necessitating additional unlabeled data and extensive training, showcasing a balance of both accuracy and computational efficiency.

Implications and Future Directions

The implications of this work are multifaceted, addressing both practical applications and theoretical explorations within model compression. Practically, CoFi presents a viable method for deploying compact, efficient models suitable for resource-constrained environments, fulfilling the increased demand for real-time and embedded AI applications. Theoretically, CoFi sheds light on the potential of structured pruning as a powerful alternative to traditional model distillation, especially in scenarios requiring significant computational resources for training.

Future investigations might focus on extending CoFi's structured pruning to pre-training phases, possibly generating task-agnostic models with enhanced flexibility and efficiency. Additionally, adapting CoFi techniques to other architectures, such as hierarchical transformers or those tailored for specific domains like vision or speech, could yield beneficial insights and applications.

In sum, "Structured Pruning Learns Compact and Accurate Models" contributes valuable advancements to the field of model compression, emphasizing the potency of structured pruning within neural networks and providing a compelling alternative to data-intensive distillation methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mengzhou Xia (34 papers)
  2. Zexuan Zhong (17 papers)
  3. Danqi Chen (84 papers)
Citations (159)
Youtube Logo Streamline Icon: https://streamlinehq.com