Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers (2402.04744v1)

Published 7 Feb 2024 in cs.LG and cs.AR

Abstract: N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and LLMs at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.

PDF HTML Abstract

Introduction

The efficiency of deploying large-scale neural networks depends heavily on model compression techniques. Among these, N:M structured sparsity has become notable due to its favorable balance between model performance and the reduction in memory and computational requirements. However, as sparsity levels increase, maintaining model quality becomes a challenge. This paper presents an innovative approach to address the quality degradation in high-sparsity regimes, identifying the introduction of noise in gradient magnitudes as a key issue in current training methods. The authors introduce decay mechanisms in training to mitigate this impact, significantly improving model quality for vision and LLMs.

Gradual Noise Reduction Through Decay Mechanisms

This work's core proposition is a remarkable relationship between gradient noise and sparsity levels. At higher sparsity, gradient noise is more pronounced, degrading model quality. The paper puts forth a decaying-based training regimen that progressively constricts gradient flow toward pruned weights, which, unlike previous methods, permits some gradient flow even for pruned elements during the critical early training phases. This fundamental strategy exhibits marked improvements in model quality, with documented enhancements of up to 2% and 5% for vision and LLMs, respectively, when the majority (~97%) of parameters are pruned.

Computational Efficiency in Training

A significant concern when employing sparse models is the compute cost incurred during training. The paper explores this by presenting data on model accuracy against training compute cost denominated in FLOPs. In an iso-training FLOPs scenario, this paper's proposed method proves superior, demonstrating up to a 2% increase in model accuracy while requiring over 30% fewer training FLOPs compared to the current state-of-the-art structured sparse training recipes.

Analysis and Experimental Validation

A methodical empirical analysis underpins the effectiveness of the proposed training recipes. The authors present an exhaustive experimental section, where they evaluate their methods on multiple attention-based models across various tasks. In scenarios such as image classification and language understanding, the decay-based approaches consistently outperformed the baseline across different N:M sparsity patterns. Perhaps most strikingly, the authors report an approximate 2% to 5% boost in performance at high sparsity ratios, a robust result that substantiates the practicality of their approach.

Furthermore, the paper extends its analysis to other sparsity approaches and architectures, including CNNs, illustrating the broad applicability of their decay-based training methods.

Concluding Remarks

This research pivots on the idea that fine-grained control over gradient flow during training can yield high-quality models even at significant levels of sparsity. It affirms that while sparsity is a double-edged sword, with proper training recipes, it is possible to reap the benefits of model compression without substantive loss in performance. The promise shown by these decay-based training methods could usher in a new standard for training highly-efficient yet accurate sparse neural networks. With the source code available on GitHub, their contributions are likely to have immediate and widespread impacts in the field of AI.