Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyzing the Performance Portability of Tensor Decomposition (2307.03276v1)

Published 6 Jul 2023 in cs.DC

Abstract: We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, $\Phi{(n)}$, is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for $\Phi{(n)}$ computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Cyber security through multidimensional data decompositions. In 2016 Cybersecurity Symposium (CYBERSEC), pages 59–67, 2016.
  2. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272–1299, 2012.
  3. Blocking optimization techniques for sparse tensor computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 568–577, 2018.
  4. Kenneth Czechowski. Diagnosing Performance Bottlenecks in HPC Applications. PhD thesis, Georgia Institute of Technology, 2019.
  5. Evaluating support for openmp offload features. In International Conference on Parallel Processing Companion, 2018.
  6. Kokkos: Enabling performance portability across manycore architectures. In Proc. Extreme Scaling Workshop, pages 18–24, 2013.
  7. Tensor-based anomaly detection: An interdisciplinary survey. Knowl. Based Syst., 98:130–147, 2016.
  8. Newton-based optimization for Kullback-Leibler nonnegative tensor factorizations. Optimization Methods and Software, 30(5):1002–1029, April 2015.
  9. Distributed tensor decomposition for large scale health analytics. In The World Wide Web Conference, WWW ’19, page 659–669, New York, NY, USA, 2019. Association for Computing Machinery.
  10. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, page 115–124, New York, NY, USA, 2014. Association for Computing Machinery.
  11. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  12. Pasta: A parallel sparse tensor algorithm benchmark suite. arXiv:1902.03317, 2019.
  13. John McCalpin. Memory bandwidth and machine balance in high performance computers. IEEE Technical Committee on Computer Architecture Newsletter, pages 19–25, 12 1995.
  14. Portability across DOE Office of Science HPC facilities. https://performanceportability.org/. Accessed: 2021-06-10.
  15. Software for sparse tensor decomposition on emerging computing architectures. SIAM Journal on Scientific Computing, 41(3):C269–C290, 2019.
  16. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017.
  17. FROSTT: The formidable repository of open sparse tensors and tools. Available online, 2017. http://frostt.io/.
  18. Exascale Scientific Applications: Scalability and Performance Portability. Chapman & Hall/CRC, 1st edition, 2017.
  19. Sparten: Leveraging kokkos for on-node parallelism in a second-order method for fitting canonical polyadic tensor models to poisson data. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7, 2020.
  20. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, April 2009.

Summary

We haven't generated a summary for this paper yet.