Papers
Topics
Authors
Recent
2000 character limit reached

Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design (2405.01775v2)

Published 2 May 2024 in cs.AR and cs.LG

Abstract: The development of model compression is continuously motivated by the evolution of various neural network accelerators with ASIC or FPGA. On the algorithm side, the ultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware. However, such a "design-and-deploy" workflow faces under-explored challenges in the current hardware-algorithm co-design community. First, although the state-of-the-art quantization algorithm can achieve low precision with negligible degradation of accuracy, the latest deep learning framework (e.g., PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction. Secondly, the objective of quantization is to enable the computation with low-precision data. However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the final output of the quantizer is the "discretized" floating-point values, ignoring the practical needs and adding additional workload to hardware designers for integer parameter extraction and layer fusion. Finally, the compression toolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limited degree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC or FPGA-based accelerator design. To resolve these challenges, we propose Torch2Chip, an open-sourced, fully customizable, and high-performance toolkit that supports user-designed compression followed by automatic model fusion and parameter extraction. Torch2Chip incorporates the hierarchical design workflow, and the user-customized compression algorithm will be directly packed into the deployment-ready format for prototype chip verification with either CNN or vision transformer (ViT). The code is available at https://github.com/SeoLabCornell/torch2chip.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Vicregl: Self-supervised learning of local visual features. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  3. Accelerating neural-ode inference on fpgas with two-stage structured pruning and history-based stepsize search. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp.  177–183, 2023.
  4. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML), 2020.
  5. Accurate and efficient 2-bit quantized neural networks. Proceedings of Machine Learning and Systems, 1:348–359, 2019.
  6. Pim-prune: Fine-grain dcnn pruning for crossbar-based process-in-memory architecture. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pp.  1–6. IEEE, 2020.
  7. A 12.4TOPS/W @ 136GOPS AI-IoT system-on-chip with 16 RISC-V, 2-to-8b precision-scalable DNN acceleration and 30 In IEEE International Solid-State Circuits Conference (ISSCC), pp.  21–23, 2023. doi: 10.1109/ISSCC42615.2023.10067643.
  8. 16.7 a 40-310tops/w sram-based all-digital up to 4b in-memory computing multi-tiled nn accelerator in fd-soi 18nm for deep-learning edge applications. In IEEE International Solid-State Circuits Conference (ISSCC), pp.  260–262, 2023. doi: 10.1109/ISSCC42615.2023.10067422.
  9. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  11. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning (ICML), 2020.
  12. Openvino deep learning workbench: Comprehensive analysis and tuning of neural networks inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.  0–0, 2019.
  13. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  14. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
  15. Abcd: Arbitrary bitwise coefficient for de-quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5876–5885, 2023.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  18. 2-bit-per-cell rram-based in-memory computing for area-/energy-efficient deep learning. IEEE Solid-State Circuits Letters, 3:194–197, 2020.
  19. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  20. A nonvolatile AI-edge processor with 4MB SLC-MLC hybrid-mode ReRAM compute-in-memory macro and 51.4-251TOPS/W. In IEEE International Solid-State Circuits Conference (ISSCC), pp.  15–17, 2023. doi: 10.1109/ISSCC42615.2023.10067610.
  21. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp.  448–456. pmlr, 2015.
  22. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  23. Compressing lstm networks with hierarchical coarse-grain sparsity. Interspeech 2020, 2020.
  24. UNPU: an energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE Journal of Solid-State Circuits, 54(1):173–185, 2019. doi: 10.1109/JSSC.2018.2865489.
  25. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations (ICLR), 2018.
  26. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In International Conference on Learning Representations, 2020.
  27. {BRECQ}: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021a.
  28. Brecq: Pushing the limit of post-training quantization by block reconstruction. https://github.com/yhhhli/BRECQ, 2021b.
  29. I-vit: Integer-only quantization for efficient vision transformer inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17065–17075, 2023.
  30. Sparse training via boosting pruning plasticity with neuroregeneration. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  31. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20321–20330, 2023.
  32. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization, 2022a.
  33. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  34. Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), 2021.
  35. Slimmed asymmetrical contrastive learning and cross distillation for lightweight model training. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.
  36. A 127.8tops/w arbitrarily quantized 1-to-8b scalable-precision accelerator for general-purpose deep learning with reduction of storage, logic and latency waste. In IEEE International Solid-State Circuits Conference (ISSCC), pp.  21–23, 2023. doi: 10.1109/ISSCC42615.2023.10067615.
  37. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
  38. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
  39. Profit: A novel training method for sub-4-bit mobilenet models. In European Conference on Computer Vision (ECCV), 2020.
  40. Lookahead: A far-sighted alternative of magnitude-based pruning. In International Conference on Learning Representations (ICLR), 2020.
  41. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS). 2019.
  42. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (ICCV), 2023.
  43. Structural pruning via latency-saliency knapsack. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  44. Neural network quantization with ai model efficiency toolkit (aimet). arXiv preprint arXiv:2201.08442, 2022.
  45. Toward accurate post-training quantization for image super resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  46. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  47. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations (ICLR), 2019.
  48. Digital-assisted analog in-memory computing with rram devices. In 2023 International VLSI Symposium on Technology, Systems and Applications (VLSI-TSA/VLSI-DAT), 2023.
  49. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization. In International Conference on Learning Representations (ICLR), 2022.
  50. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  51. Highlight: Efficient and flexible dnn acceleration with hierarchical structured sparsity. EEE/ACM International Symposium on Microarchitecture (MICRO), 2023.
  52. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), 2023.
  53. Convolutional neural network on neural compute stick for voxelized point-clouds classification. In 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp.  1–7. IEEE, 2017.
  54. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
  55. Harmonious coexistence of structured weight pruning and ternarization for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
  56. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp.  23–32, 2019b.
  57. Xnor-sram: In-memory computing sram macro for binary/ternary deep neural networks. IEEE Journal of Solid-State Circuits, 55(6):1733–1743, 2020.
  58. Unified visual transformer compression. In International Conference on Learning Representations (ICLR), 2022.
  59. Mest: Accurate and fast memory-economic sparse training framework on the edge. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  60. Barlow Twins: Self-supervised Learning via Redundancy Reduction. In International Conference on Machine Learning (ICML), 2021.
  61. PIMCA: a programmable in-memory computing accelerator for energy-efficient dnn inference. IEEE Journal of Solid-State Circuits, 58(5):1436–1449, 2023. doi: 10.1109/JSSC.2022.3211290.
  62. Learning best combination for efficient n: M sparsity. Advances in Neural Information Processing Systems (NeurIPS), 2022a.
  63. Pokebnn: A binary pursuit of lightweight accuracy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  64. Automatic generation of multi-precision multi-arithmetic cnn accelerators for fpgas. In 2019 International Conference on Field-Programmable Technology (ICFPT), 2019.
  65. Learning n:m fine-grained structured sparse neural networks from scratch. In International Conference on Learning Representations (ICLR), 2021.
  66. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 3 likes about this paper.