Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks (2307.08013v2)

Published 16 Jul 2023 in cs.LG and cs.CV

Abstract: Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. A. Agarwala and S. S. Schoenholz. Deep equilibrium networks are sensitive to initialization statistics. In International Conference on Machine Learning, pages 136–160. PMLR, 2022.
  2. B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136–145. PMLR, 2017.
  3. J. Ba and B. Frey. Adaptive dropout for training deep neural networks. Advances in neural information processing systems, 26, 2013.
  4. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.
  5. Multiscale deep equilibrium models. Advances in Neural Information Processing Systems, 33:5238–5250, 2020.
  6. Stabilizing equilibrium models by jacobian regularization. arXiv preprint arXiv:2106.14342, 2021.
  7. Parameter-efficient masking networks. Advances in Neural Information Processing Systems, 2022.
  8. Can weight sharing outperform random architecture search? an investigation with tunas. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14311–14320. IEEE, 2020.
  9. Neural ordinary differential equations. In Advances in neural information processing systems, volume 31, 2018.
  10. Compressing neural networks with the hashing trick. In International conference on machine learning, pages 2285–2294. PMLR, 2015.
  11. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35, 2021.
  12. R. Dabre and A. Fujita. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299, 2019.
  13. Universal transformers. In International Conference on Learning Representations, 2019.
  14. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
  15. Pruning neural networks at initialization: Why are we missing the mark? In International Conference on Learning Representations, 2021.
  16. Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
  17. On training implicit models. Advances in Neural Information Processing Systems, 34:24247–24260, 2021.
  18. Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678, 2019.
  19. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  20. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  22. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  23. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
  24. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
  25. Survey of dropout methods for deep neural networks. arXiv preprint arXiv:1904.13310, 2019.
  26. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
  27. Training graph neural networks with 1000 layers. In International conference on machine learning, pages 6437–6449. PMLR, 2021.
  28. Group sparsity: The hinge between filter pruning and decomposition for network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8018–8027, 2020.
  29. Stable and expressive recurrent vision models. Advances in Neural Information Processing Systems, 33:10456–10467, 2020.
  30. A comprehensive study of weight sharing in graph networks for 3d human pose estimation. In European Conference on Computer Vision, pages 318–334. Springer, 2020.
  31. Dropout reduces underfitting. arXiv preprint arXiv:2303.01500, 2023.
  32. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
  33. Structured bayesian pruning via log-normal multiplicative noise. Advances in Neural Information Processing Systems, 30, 2017.
  34. Tiled convolutional neural networks. Advances in neural information processing systems, 23, 2010.
  35. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020.
  36. Sparsemap: Differentiable sparse structured inference. In International Conference on Machine Learning, pages 3799–3808. PMLR, 2018.
  37. Deep equilibrium approaches to diffusion models. In Advances in neural information processing systems, 2022.
  38. Accelerating inference with sparsity using the nvidia ampere architecture and nvidia tensorrt. NVIDIA Developer Technical Blog, https://developer. nvidia. com/blog/accelerating-inference-with-sparsityusing-ampere-and-tensorrt, 2021.
  39. I. Price and J. Tanner. Dense for the price of sparse: Improved performance of sparsely initialized networks via a subspace offset. In International Conference on Machine Learning, pages 8620–8629. PMLR, 2021.
  40. Sanity-checking pruning methods: Random tickets can win the jackpot. Advances in Neural Information Processing Systems, 33:20390–20401, 2020.
  41. S. Takase and S. Kiyono. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021.
  42. Soft weight-sharing for neural network compression. In International Conference on Learning Representations, 2017.
  43. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066. PMLR, 2013.
  44. Why is the state of neural network pruning so confusing? on the fairness, comparison setup, and trainability in network pruning. arXiv preprint arXiv:2301.05219, 2023.
  45. Recent advances on neural network pruning at initialization. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, Vienna, Austria, pages 23–29, 2022.
  46. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pages 6545–6554. PMLR, 2019.
  47. S. Wang and C. Manning. Fast dropout training. In international conference on machine learning, pages 118–126. PMLR, 2013.
  48. Weight-sharing multi-stage multi-scale ensemble convolutional neural network. International Journal of Machine Learning and Cybernetics, 10(7):1631–1642, 2019.
  49. E. Winston and J. Z. Kolter. Monotone operator equilibrium networks. Advances in Neural Information Processing Systems, 33:10718–10728, 2020.
  50. Tied transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5466–5473, 2019.
  51. Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024, 2019.
  52. Weight-sharing neural architecture search: A battle to shrink the optimization gap. ACM Computing Surveys (CSUR), 54(9):1–37, 2021.
  53. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46–55, 2018.
  54. A unified framework of dnn weight pruning and weight clustering/quantization using admm. arXiv preprint arXiv:1811.01907, 2018.
  55. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, 2018.
  56. Learning best combination for efficient n:m sparsity. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  57. Deeper insights into weight sharing in neural architecture search. arXiv preprint arXiv:2001.01431, 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.