Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers (2407.21418v1)

Published 31 Jul 2024 in cs.LG and cs.DC

Abstract: Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR, 2021.
  2. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  3. Segment everything everywhere all at once. CoRR, abs/2304.06718, 2023.
  4. NVIDIA. cublas: Cuda basic linear algebra subroutine library. https://developer.nvidia.com/cublas, 2021.
  5. cudnn: Efficient primitives for deep learning. https://developer.nvidia.com/cudnn, 2014.
  6. NVIDIA. Cutlass: Cuda templates for linear algebra subroutines. https://github.com/NVIDIA/cutlass, 2022.
  7. Intel. oneapi deep neural network library (onednn). https://github.com/oneapi-src/oneDNN, 2021.
  8. TVM: an automated end-to-end optimizing compiler for deep learning. In Andrea C. Arpaci-Dusseau and Geoff Voelker, editors, 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, pages 578–594. USENIX Association, 2018.
  9. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Trans. Archit. Code Optim., 16(4):38:1–38:26, 2020.
  10. Learning to optimize halide with tree search and random programs. ACM Trans. Graph., 38(4):121:1–121:12, 2019.
  11. MLIR: scaling compiler infrastructure for domain specific computation. In Jae W. Lee, Mary Lou Soffa, and Ayal Zaks, editors, IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021, Seoul, South Korea, February 27 - March 3, 2021, pages 2–14. IEEE, 2021.
  12. Ansor: Generating high-performance tensor programs for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020, pages 863–879. USENIX Association, 2020.
  13. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125. ISCA, 2016.
  14. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  15. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  16. Dietcode: Automatic optimization for dynamic tensor programs. In Diana Marculescu, Yuejie Chi, and Carole-Jean Wu, editors, Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, 2022.
  17. Haotuner: A hardware adaptive operator auto-tuner for dynamic shape tensor compilers. IEEE Transactions on Computers, 72(11):3178–3190, 2023.
  18. ROLLER: Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233–248, Carlsbad, CA, July 2022. USENIX Association.
  19. NVIDIA. Nsight compute cli. https://docs.nvidia.com/nsight-compute/NsightComputeCli/, 2023.
  20. Xgboost: A scalable tree boosting system. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 785–794. ACM, 2016.
  21. Chameleon: Adaptive code optimization for expedited deep neural network compilation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  22. Adatune: Adaptive tensor program compilation made efficient. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  23. Optimization by simulated annealing. In none, volume 220, pages 671–680, 1983.
  24. Pradnya A. Vikhar. Evolutionary algorithms: A critical review and its future prospects. In 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), pages 261–265, 2016.
  25. Ke Zhang Junjie Bai, Fang Lu. Onnx: open neural network exchange. https://onnx.ai/, 2019.
  26. Relay: A high-level IR for deep learning. CoRR, abs/1904.08368, 2019.
  27. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  28. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  29. PyTorch Team. The nested tensor package prototype. https://github.com/pytorch/nestedtensor, 2022.
  30. NVIDIA. Tensorrt: Inference accelerator for deep learning. https://developer.nvidia.com/tensorrt, 2021.
  31. Xla: Tensorflow, compiled. tensorflow dev summit (2017). https://www.tensorflow.org/xla, 2017.
  32. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 559–578. USENIX Association, 2022.
  33. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In James R. Larus, Luis Ceze, and Karin Strauss, editors, ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, pages 859–873. ACM, 2020.
  34. Heron: Automatically constrained high-performance library generation for deep learning accelerators. In Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift, editors, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, pages 314–328. ACM, 2023.
  35. Tensor program optimization with probabilistic programs. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  36. AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In Valentina Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang, editors, ISCA ’22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, pages 874–887. ACM, 2022.
  37. Hidet: Task-mapping programming paradigm for deep learning tensor programs. In Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift, editors, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, pages 370–384. ACM, 2023.
  38. EINNET: optimizing tensor programs with derivation-based transformations. In Roxana Geambasu and Ed Nightingale, editors, 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023, pages 739–755. USENIX Association, 2023.
  39. Tencent. Blazerml. https://github.com/Tencent/BlazerML-tvm, 2023.
  40. Autotuning convolutions is easier than you think. ACM Trans. Archit. Code Optim., 20(2):20:1–20:24, 2023.
  41. Cody Yu. Slective tuning. https://github.com/apache/tvm/pull/4187, 2019.
  42. Nimble: Efficiently compiling dynamic neural networks for model inference. In Alex Smola, Alex Dimakis, and Ion Stoica, editors, Proceedings of Machine Learning and Systems 2021, MLSys 2021, virtual, April 5-9, 2021. mlsys.org, 2021.
  43. TVM. Dlight. https://discuss.tvm.apache.org/t/dlight-enabling-fast-and-efficient-kernel-generation-by-hardware-information/16273, 2024.
  44. DISC: A dynamic shape compiler for machine learning workloads. In Eiko Yoneki and Paul Patras, editors, EuroMLSys@EuroSys 2021, Proceedings of the 1st Workshop on Machine Learning and Systemsg Virtual Event, Edinburgh, Scotland, UK, 26 April, 2021, pages 89–95. ACM, 2021.
  45. The cora tensor compiler: Compilation for ragged tensors with minimal padding. In Diana Marculescu, Yuejie Chi, and Carole-Jean Wu, editors, Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, 2022.
  46. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proc. IEEE, 106(11):1921–1934, 2018.
  47. Optimizing dynamic-shape neural networks on accelerators via on-the-fly micro-kernel polymerization. In Rajiv Gupta, Nael B. Abu-Ghazaleh, Madan Musuvathi, and Dan Tsafrir, editors, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2024, La Jolla, CA, USA, 27 April 2024- 1 May 2024, pages 797–812. ACM, 2024.
  48. Tenset: A large-scale program performance dataset for learned tensor compilers. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
  49. Moses: Efficient exploitation of cross-device transferable features for tensor program optimization. CoRR, abs/2201.05752, 2022.
  50. TLP: A deep learning-based cost model for tensor program tuning. In Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift, editors, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, pages 833–845. ACM, 2023.
  51. Neurovectorizer: end-to-end vectorization with deep reinforcement learning. In CGO ’20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, San Diego, CA, USA, February, 2020, pages 242–255. ACM, 2020.
  52. Tensorir: An abstraction for automatic tensorized program optimization. In Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift, editors, Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, pages 804–817. ACM, 2023.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com