Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Movement Pruning: Adaptive Sparsity by Fine-Tuning (2005.07683v2)

Published 15 May 2020 in cs.CL and cs.LG

Abstract: Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained LLMs, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Overview of "Movement Pruning: Adaptive Sparsity by Fine-Tuning"

Model sparsity is a pivotal factor in neural network efficiency, especially for the deployment of large pretrained models in resource-constrained environments. The paper "Movement Pruning: Adaptive Sparsity by Fine-Tuning" addresses the need for more adaptive pruning techniques in the context of transfer learning of LLMs. The authors introduce a novel pruning method that they term "movement pruning," which shows proficiency particularly in high-sparsity regimes.

The concept of movement pruning diverges from the traditional magnitude pruning by focusing on weight changes during the fine-tuning process rather than their absolute magnitudes in the pre-training phase. This shift to a first-order method from a zeroth-order approach allows the pruning mechanism to better adapt to specific fine-tuning tasks. The methodology employed involves maintaining an importance score for each weight, updated using gradients during training, combined with straight-through estimators to circumvent non-differentiability issues inherent in step functions.

Methodology

The methodological novelty of the paper lies in how movement pruning utilizes changes in weights, capturing movement away from zero across the training iterations. The approach is embodied in two variants: hard movement pruning, using a deterministic method to rank and select weights with the largest movement scores; and soft movement pruning, employing thresholds to determine pruning with an additional regularization term that controls sparsity levels throughout training.

Experimental Results

The empirical validation of movement pruning demonstrates its effectiveness across a range of established NLP tasks including SQuAD, MNLI, and QQP. The results reveal that movement pruning not only surpasses magnitude pruning in high-sparsity scenarios but also compares favorably with existing advanced pruning techniques like L0L_0 regularization.

For instance, the soft movement pruning method retains around 95% of original model accuracy despite pruning BERT-base to retain merely 5% of encoder weights. Furthermore, when coupled with distillation, movement pruning methods show even lower performance degradation at extreme sparsity, thereby highlighting potential for practical compression applications without significant loss in task performance.

Implications and Future Directions

Movement pruning represents a significant advancement in model compression for transfer learning. It opens avenues for efficiently deploying state-of-the-art NLP models on edge devices, contributing to reduced energy consumption and enhanced privacy by eliminating the need for continual data communication with centralized servers.

From a theoretical standpoint, the paper provides grounds to reconsider weight pruning criteria, advocating for task-adapted pruning mechanisms rather than static pre-trained model reductions. Future developments could explore synergies between movement pruning and structured pruning techniques to enhance both interpretability and computational efficiency. Additionally, examining the integration of hardware-specific optimizations for these highly sparse models could broaden the application scope of movement pruning within industry and research sectors alike.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Victor Sanh (21 papers)
  2. Thomas Wolf (117 papers)
  3. Alexander M. Rush (115 papers)
Citations (427)
Youtube Logo Streamline Icon: https://streamlinehq.com