Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models (2306.10460v1)
Abstract: Large pre-trained transformers have been receiving explosive attention in the past few years, due to their wide adaptability for numerous downstream applications via fine-tuning, but their exponentially increasing parameter counts are becoming a primary hurdle to even just fine-tune them without industry-standard hardware. Recently, Lottery Ticket Hypothesis (LTH) and its variants, have been exploited to prune these large pre-trained models generating subnetworks that can achieve similar performance as their dense counterparts, but LTH pragmatism is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP) which worsens with increasing model size. Motivated by the recent observations of model soups, which suggest that fine-tuned weights of multiple models can be merged to a better minima, we propose Instant Soup Pruning (ISP) to generate lottery ticket quality subnetworks, using a fraction of the original IMP cost by replacing the expensive intermediate pruning stages of IMP with computationally efficient weak mask generation and aggregation routine. More specifically, during the mask generation stage, ISP takes a small handful of iterations using varying training protocols and data subsets to generate many weak and noisy subnetworks, and superpose them to average out the noise creating a high-quality denoised subnetwork. Our extensive experiments and ablation on two popular large-scale pre-trained models: CLIP (unexplored in pruning till date) and BERT across multiple benchmark vision and language datasets validate the effectiveness of ISP compared to several state-of-the-art pruning methods. Codes are available at: \url{https://github.com/VITA-Group/instant_soup}
- Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 34:225–236, 2021.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020a.
- Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34:19974–19988, 2021a.
- The lottery tickets hypothesis for supervised and self-supervised pre-training in computer vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16306–16316, 2021b.
- Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. In The Eleventh International Conference on Learning Representations, 2023.
- Earlybert: Efficient bert training via early-bird lottery tickets. arXiv preprint arXiv:2101.00063, 2020b.
- Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Cognitive graph for multi-hop reading comprehension at scale. arXiv preprint arXiv:1905.05460, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101, 2023.
- Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- Playing lottery tickets with vision and language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 652–660, 2022.
- A survey on visual transformer. ArXiv, abs/2012.12556, 2020.
- Patching open-vocabulary models by interpolating weights. arXiv preprint arXiv:2208.05592, 2022.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- Scalp-supervised contrastive learning for cardiopulmonary disease classification and localization in chest x-rays using patient metadata. In 2021 IEEE International Conference on Data Mining (ICDM), pp. 1132–1137. IEEE, 2021a.
- Radbert-cl: Factually-aware contrastive learning for radiology report classification. Proceedings of machine learning research, 158:196–208, 2021b.
- Ros-kd: A robust stochastic knowledge distillation approach for noisy medical imaging. arXiv preprint arXiv:2210.08388, 2022a.
- Spending your winning lottery better after drawing it, 2022b. URL https://openreview.net/forum?id=O4dxuEsIo9S.
- Attend who is weak: Pruning-assisted medical image localization under sophisticated and implicit imbalances. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4987–4996, 2023.
- Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp. 9833–9844. PMLR, 2022c.
- Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
- Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
- Cancergpt: Few-shot drug pair synergy prediction using large pre-trained language models. arXiv preprint arXiv:2304.10946, 2023.
- Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
- Single frame atmospheric turbulence mitigation: A benchmark study and a new physics-inspired transformer model. ArXiv, abs/2207.10040, 2022.
- Merging models with fisher-weighted averaging. arXiv preprint arXiv:2111.09832, 2021.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Image transformer. In ICML, 2018.
- When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561, 2020.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Neural networks with late-phase weights. arXiv preprint arXiv:2007.12927, 2020.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. arXiv preprint arXiv:2010.05874, 2020.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022b.
- End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019a.
- Deep model reassembly. Advances in neural information processing systems, 35:25739–25753, 2022.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019b.
- Lottery pools: Winning more by interpolating tickets without increasing training or inference cost. arXiv preprint arXiv:2208.10842, 2022.
- Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957, 2019.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
- On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7370–7379, 2017.
- End-to-end object detection with adaptive clustering transformer. ArXiv, abs/2011.09315, 2021.
- Outline, then details: Syntactically guided coarse-to-fine code generation. arXiv preprint arXiv:2305.00909, 2023.
- Ajay Jaiswal (35 papers)
- Shiwei Liu (75 papers)
- Tianlong Chen (202 papers)
- Ying Ding (126 papers)
- Zhangyang Wang (374 papers)