Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LookupFFN: Making Transformers Compute-lite for CPU inference (2403.07221v1)

Published 12 Mar 2024 in cs.LG

Abstract: While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa LLM pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. AMD. Amd zen deep neural network (zendnn). URL https://www.amd.com/en/developer/zendnn.html.
  2. Practical and optimal LSH for angular distance. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp.  1225–1233, 2015.
  3. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  4. Bhat, A. Machine learning on - arm servers an update. linaro virtual connect 2021.
  5. Zen3: The amd 2nd-generation 7nm x86-64 microprocessor core. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pp.  1–3, 2022. doi: 10.1109/ISSCC42614.2022.9731678.
  6. Charikar, M. Similarity estimation techniques from rounding algorithms. In Reif, J. H. (ed.), Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pp.  380–388. ACM, 2002. doi: 10.1145/509907.509965.
  7. Hashing-based-estimators for kernel density in high dimensions. In Umans, C. (ed.), 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pp.  1032–1043. IEEE Computer Society, 2017. doi: 10.1109/FOCS.2017.99.
  8. Slide : In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. In Dhillon, I., Papailiopoulos, D., and Sze, V. (eds.), Proceedings of Machine Learning and Systems, volume 2, pp.  291–306, 2020. URL https://proceedings.mlsys.org/paper/2020/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper.pdf.
  9. {MONGOOSE}: A learnable {lsh} framework for efficient neural network training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=wWK7yXkULyh.
  10. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision, pp.  2857–2865, 2015.
  11. Rethinking attention with performers. In International Conference on Learning Representations (ICLR), 2021.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  13. Network pruning via transformable architecture search. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/a01a0380ca3c61428c26a231f0e49a09-Paper.pdf.
  14. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. URL http://jmlr.org/papers/v23/21-0998.html.
  15. Anatomy of high-performance deep learning convolutions on simd architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press, 2018.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  17. Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp.  10–14, 2014. doi: 10.1109/ISSCC.2014.6757323.
  18. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  19. Intel. Intrinsics for intel advanced matrix extensions (intel(r) amx) instructions. intel c++ compiler classic developer guide and reference.
  20. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  21. Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp.  1–14, 2021. doi: 10.1109/ISCA52012.2021.00010.
  22. I-bert: Integer-only bert quantization. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  5506–5518. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/kim21d.html.
  23. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020.
  24. Fastfood - approximating kernel expansions in loglinear time. In 30th International Conference on Machine Learning (ICML), 2013. URL http://jmlr.org/proceedings/papers/v28/le13.html.
  25. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
  26. Ai accelerator on ibm telum processor: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, pp.  1012–1028, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104. doi: 10.1145/3470496.3533042. URL https://doi.org/10.1145/3470496.3533042.
  27. Roberta: A robustly optimized bert pretraining approach, 2019a. URL https://arxiv.org/abs/1907.11692.
  28. Optimizing cnn model inference on cpus. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’19, pp.  1025–1040, USA, 2019b. USENIX Association. ISBN 9781939133038.
  29. A survey of deep learning on cpus: Opportunities and co-optimizations. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5095–5115, 2022. doi: 10.1109/TNNLS.2021.3071762.
  30. ACDC: A structured efficient linear layer. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.05946.
  31. Sapphire rapids: The next-generation intel xeon scalable processor. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pp.  44–46, 2022. doi: 10.1109/ISSCC42614.2022.9731107.
  32. Reduct: Keep it close, keep it cool! : Efficient scaling of dnn inference on multi-core cpus with near-cache compute. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp.  167–180, 2021. doi: 10.1109/ISCA52012.2021.00022.
  33. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  34. Intel alder lake cpu architectures. IEEE Micro, 42(3):13–19, 2022. doi: 10.1109/MM.2022.3164338.
  35. A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models. arXiv preprint arXiv:1703.05160, 2017.
  36. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  37. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
  38. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://www.aclweb.org/anthology/N18-1101.
  39. 3d v-cache: the implementation of a hybrid-bonded 64mb stacked cache for a 7nm x86-64 cpu. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pp.  428–429, 2022. doi: 10.1109/ISSCC42614.2022.9731565.
  40. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  41. Deep fried convnets. In 2015 IEEE International Conference on Computer Vision (ICCV), pp.  1476–1483, 2015. doi: 10.1109/ICCV.2015.173.
  42. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), pp.  36–39, 2019. doi: 10.1109/EMC2-NIPS53020.2019.00016.
  43. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  44. You only sample (almost) once: Linear cost self-attention via bernoulli sampling. In International Conference on Machine Learning (ICML), 2021.
  45. Multi resolution analysis (MRA) for approximate self-attention. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  25955–25972. PMLR, 17–23 Jul 2022.
  46. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’19, pp.  1049–1062, USA, 2019. USENIX Association. ISBN 9781939133038.
  47. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets