Training Neural Networks with Fixed Sparse Masks

Published 18 Nov 2021 in cs.LG | (2111.09839v1)

Abstract: During typical gradient-based training of deep neural networks, all of the model's parameters are updated at each iteration. Recent work has shown that it is possible to update only a small subset of the model's parameters during training, which can alleviate storage and communication requirements. In this paper, we show that it is possible to induce a fixed sparse mask on the model's parameters that selects a subset to update over many iterations. Our method constructs the mask out of the $k$ parameters with the largest Fisher information as a simple approximation as to which parameters are most important for the task at hand. In experiments on parameter-efficient transfer learning and distributed training, we show that our approach matches or exceeds the performance of other methods for training with sparse updates while being more efficient in terms of memory usage and communication costs. We release our code publicly to promote further applications of our approach.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (166)

View on Semantic Scholar

Summary

The paper introduces the FISH mask, a novel approach that selects key parameters based on Fisher information for efficient updates.
It achieves comparable performance to full-model training on benchmarks like GLUE while updating only 0.5% of parameters.
The method significantly reduces memory use, communication costs, and checkpoint sizes in distributed and resource-constrained environments.

Training Neural Networks with Fixed Sparse Masks

This paper explores a novel approach to training deep neural networks by utilizing fixed sparse masks, thereby updating only a select subset of a model's parameters iteratively. The motivation behind this strategy is to address the expensive storage and communication requirements typically associated with gradient-based training involving a full update of the network's parameters. The authors propose an efficient method for identifying and updating only the most critical parameters, as determined by the Fisher information, and demonstrate this approach's effectiveness in parameter-efficient transfer learning and distributed training scenarios.

Methodology

In a traditional setting, stochastic gradient descent (SGD) iteratively updates all parameters of a deep neural network. This approach, while effective, incurs significant costs in terms of communication and memory, especially as models scale to hundreds of millions of parameters. The research introduces a solution through the development of fixed sparse masks, termed the FISH (Fisher-Induced Sparse uncHanging) mask. This mask is pre-computed by selecting parameters with the largest Fisher information, indicating their relative importance for task-specific outputs.

The Fisher information matrix is used as a criterion to pre-select a fixed subset of parameters to be updated, without recalculating over iterations. The choice of these parameters is crucial and involves finding the top-k parameters with the largest Fisher information derived via backpropagation, allowing the model to remain efficient even with a reduced subset of updates.

Experimental Results

Several experiments were conducted to evaluate the effectiveness of FISH Masks. The findings confirm that this method can match or exceed the performance of other sparse update training methods while achieving improved efficiency in memory and communication costs. Particularly, the authors demonstrate:

Parameter-Efficient Transfer Learning:
- When applying FISH Masks to fine-tune BERT\textsubscript{LARGE} on the GLUE benchmark, the results achieved were comparable to full-model fine-tuning while updating only 0.5% of the parameters. The FISH Mask method outperforms random mask baselines and is competitive with methods like Diff Pruning and BitFit.
Distributed Training:
- In scenarios involving distributed training where communication costs are critical, FISH Masks facilitate significant reductions in necessary communication bandwidth by leveraging sparse updates. Experiments on a ResNet-34 model trained on CIFAR-10 demonstrated that performance was maintained with fewer communication steps compared to standard procedures.
Efficient Checkpointing:
- FISH Masks prove advantageous in reducing checkpoint sizes during training. By only saving the updated parameters and their indices, substantial disk space savings are realized, which can be particularly useful in environments with restricted storage capacity.

The empirical validations reflect the robustness of the FISH Mask across different sample sizes and suggest that even with a small number of samples, effective sparse masks can be derived.

Implications and Future Directions

This research advances the field by proposing a method that balances computational efficiency with model performance, especially beneficial in distributed training settings or environments with bandwidth constraints. The implications span both practical applications in reducing hardware resource allocation and theoretical insights into optimizing neural network training processes.

There is substantial potential for future exploration into further refining parameter update strategies or integrating additional metrics of parameter importance. Moreover, the concept has promising applications in federated learning scenarios, enhancing efficiency while contemplating privacy constraints inherent in such cooperation-based models. The discussion also opens avenues for exploring the intersection of sparse training techniques and network compression, as suggested by the apparent implications for concepts like the Lottery Ticket Hypothesis.