Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation (2403.06348v1)

Published 11 Mar 2024 in cs.DC, cs.DS, and cs.PF

Abstract: High-dimensional sparse data emerge in many critical application domains such as cybersecurity, healthcare, anomaly detection, and trend analysis. To quickly extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes, data distributions, and sparsity, which pose significant challenges for making efficient use of modern parallel architectures. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (ALTO), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. To demonstrate the efficacy of ALTO, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across a range of real-world sparse tensors. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, ALTO outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies, ALTO achieves more than 5.1x geometric mean speedup at a fraction (25%) of their storage.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. J. C. Ho, J. Ghosh, and J. Sun, “Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 115–124. [Online]. Available: https://doi.org/10.1145/2623330.2623658
  2. Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun, “Rubik: Knowledge guided tensor factorization and completion for health data analytics,” ser. KDD ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 1265–1274. [Online]. Available: https://doi.org/10.1145/2783258.2783395
  3. T. Kobayashi, A. Sapienza, and E. Ferrara, “Extracting the multi-timescale activity patterns of online financial markets,” Scientific Reports, vol. 8, no. 1, pp. 1–11, 2018. [Online]. Available: https://doi.org/10.1038/s41598-018-29537-w
  4. H. Fanaee-T and J. Gama, “Tensor-based anomaly detection: An interdisciplinary survey,” Knowledge-Based Systems, vol. 98, pp. 130 – 147, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0950705116000472
  5. T. G. Kolda and J. Sun, “Scalable tensor decompositions for multi-aspect data mining,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 363–372.
  6. E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “Tensors for data mining and data fusion: Models, applications, and scalable algorithms,” ACM Trans. Intell. Syst. Technol., vol. 8, no. 2, Oct. 2016. [Online]. Available: https://doi.org/10.1145/2915921
  7. A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models,” J. Mach. Learn. Res., vol. 15, no. 1, p. 2773–2832, Jan. 2014. [Online]. Available: https://dl.acm.org/doi/10.5555/2627435.2697055
  8. N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Transactions on Signal Processing, vol. 65, no. 13, pp. 3551–3582, 2017.
  9. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
  10. B. W. Bader and T. G. Kolda, “Efficient matlab computations with sparse and factored tensors,” SIAM JOURNAL ON SCIENTIFIC COMPUTING, vol. 30, no. 1, pp. 205–231, 2007. [Online]. Available: https://doi.org/10.1137/060676489
  11. D. M. Dunlavy, T. G. Kolda, and E. Acar, “Temporal link prediction using matrix and tensor factorizations,” ACM Trans. Knowl. Discov. Data, vol. 5, no. 2, feb 2011. [Online]. Available: https://doi.org/10.1145/1921632.1921636
  12. J. Sun, D. Tao, and C. Faloutsos, “Beyond streams and graphs: Dynamic tensor analysis,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’06.   New York, NY, USA: Association for Computing Machinery, 2006, p. 374–383. [Online]. Available: https://doi.org/10.1145/1150402.1150445
  13. E. C. Chi and T. G. Kolda, “On tensors, sparsity, and nonnegative factorizations,” SIAM Journal on Matrix Analysis and Applications, vol. 33, no. 4, pp. 1272–1299, 2012.
  14. J. Choi, X. Liu, S. Smith, and T. Simon, “Blocking optimization techniques for sparse tensor computation,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, pp. 568–577.
  15. O. Kaya and B. Uçar, “Scalable sparse tensor decompositions in distributed memory systems,” in SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–11.
  16. A. Nguyen, A. E. Helal, F. Checconi, J. Laukemann, J. J. Tithi, Y. Soh, T. Ranadive, F. Petrini, and J. W. Choi, “Efficient, out-of-Memory Sparse MTTKRP on Massively Parallel Architectures,” in Proceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22.   New York, NY, USA: Association for Computing Machinery, 2022.
  17. J. Li, J. Sun, and R. Vuduc, “Hicoo: Hierarchical storage of sparse tensors,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 238–252.
  18. Q. Sun, Y. Liu, M. Dun, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “Sptfs: Sparse tensor format selection for mttkrp via deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20.   IEEE Press, 2020. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3433701.3433724
  19. S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “Splatt: Efficient and parallel sparse tensor-matrix multiplication,” pp. 61–70, 2015.
  20. B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi, “A unified optimization approach for sparse tensor operations on gpus,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017, pp. 47–57.
  21. K. Teranishi, D. M. Dunlavy, J. M. Myers, and R. F. Barrett, “Sparten: Leveraging kokkos for on-node parallelism in a second-order method for fitting canonical polyadic tensor models to poisson data,” IEEE High Performance Extreme Computing Conference, vol. 0, no. 0, p. 0, 2020.
  22. P.-D. Letourneau, M. Baskaran, T. Henretty, J. Ezick, and R. Lethin, “Computationally efficient cp tensor decomposition update framework for emerging component discovery in streaming data,” in 2018 IEEE High Performance extreme Computing Conference (HPEC), 2018, pp. 1–8.
  23. M. Baskaran, B. Meister, N. Vasilache, and R. Lethin, “Efficient and scalable computations with sparse tensors,” in 2012 IEEE Conference on High Performance Extreme Computing, 2012, pp. 1–6.
  24. S. Smith and G. Karypis, “Tensor-matrix products with a compressed sparse tensor,” in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, ser. IA33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT ’15.   New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2833179.2833183
  25. E. T. Phipps and T. G. Kolda, “Software for sparse tensor decomposition on emerging computing architectures,” SIAM Journal on Scientific Computing, vol. 41, no. 3, pp. C269–C290, 2019. [Online]. Available: https://doi.org/10.1137/18M1210691
  26. J. Li, B. Uçar, U. V. Çatalyürek, J. Sun, K. Barker, and R. Vuduc, “Efficient and effective sparse tensor reordering,” in Proceedings of the ACM International Conference on Supercomputing, ser. ICS ’19.   New York, NY, USA: Association for Computing Machinery, 2019, p. 227–237. [Online]. Available: https://doi.org/10.1145/3330345.3330366
  27. I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan, “Load-balanced sparse mttkrp on gpus,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019, pp. 123–133.
  28. I. Nisa, J. Li, A. Sukumaran-Rajam, P. S. Rawat, S. Krishnamoorthy, and P. Sadayappan, “An efficient mixed-mode representation of sparse tensors,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19.   New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3295500.3356216
  29. A. E. Helal, J. Laukemann, F. Checconi, J. J. Tithi, T. Ranadive, F. Petrini, and J. Choi, “ALTO: Adaptive Linearized Storage of Sparse Tensors,” in Proceedings of the ACM International Conference on Supercomputing, Jun. 2021, pp. 404–416, arXiv:2102.10245 [cs].
  30. Y. Soh, A. E. Helal, F. Checconi, J. Laukemann, J. J. Tithi, T. Ranadive, F. Petrini, and J. W. Choi, “Dynamic tensor linearization and time slicing for efficient factorization of infinite data streams,” in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2023, pp. 402–412.
  31. S. Chou, F. Kjolstad, and S. Amarasinghe, “Format abstraction for sparse tensor algebra compilers,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, Oct. 2018. [Online]. Available: https://doi.org/10.1145/3276493
  32. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’16.   USA: USENIX Association, 2016, p. 265–283. [Online]. Available: https://dl.acm.org/doi/10.5555/3026877.3026899
  33. N. Vervliet, O. Debals, and L. De Lathauwer, “Tensorlab 3.0 — numerical optimization strategies for large-scale constrained and coupled matrix/tensor factorization,” in 2016 50th Asilomar Conference on Signals, Systems and Computers, 2016, pp. 1733–1738.
  34. S. Smith and G. Karypis, “Accelerating the tucker decomposition with compressed sparse tensors,” in Euro-Par 2017: Parallel Processing, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds.   Cham: Springer International Publishing, 2017, pp. 653–668. [Online]. Available: https://doi.org/10.1007/978-3-319-64203-1\_47
  35. A. P. Harrison and D. Joseph, “High performance rearrangement and multiplication routines for sparse tensor arithmetic,” SIAM Journal on Scientific Computing, vol. 40, no. 2, pp. C258–C281, 2018. [Online]. Available: https://doi.org/10.1137/17M1115873
  36. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, September 2009. [Online]. Available: ttps://doi.org/10.1137/07070111X
  37. S. Liu and G. Trenkler, “Hadamard, khatri-rao, kronecker, and other matrix products,” International Journal of Information and Systems Sciences, vol. 4, no. 1, pp. 160–177, 2008.
  38. S. Hansen, T. Plantenga, and T. G. Kolda, “Newton-based optimization for Kullback-Leibler nonnegative tensor factorizations,” Optimization Methods and Software, vol. 30, no. 5, pp. 1002–1029, April 2015.
  39. G. Peano, “Sur une courbe, qui remplit toute une aire plane,” Mathematische Annalen, vol. 36, no. 1, pp. 157–160, 1890. [Online]. Available: https://doi.org/10.1007/BF01199438
  40. G. M. Morton, “A computer oriented geodetic data base; and a new technique in file sequencing,” IBM Ltd., 150 Laurier Ave., Ottawa, Ontario, Canada, Tech. Rep., 1966. [Online]. Available: https://dominoweb.draco.res.ibm.com/reports/Morton1966.pdf
  41. T. Gruber, J. Eitzinger, G. Hager, and G. Wellein, “Likwid,” Nov. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.10105559
  42. I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, “Haten2: Billion-scale tensor decompositions,” in 2015 IEEE 31st International Conference on Data Engineering, 2015, pp. 1047–1058.
  43. S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, 2009. [Online]. Available: http://doi.acm.org/10.1145/1498765.1498785
  44. A. Abel and J. Reineke, “uops.info: Characterizing latency, throughput, and port usage of instructions on intel microarchitectures,” in ASPLOS, ser. ASPLOS ’19.   New York, NY, USA: ACM, 2019, pp. 673–686. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304062
  45. S. Mittal, “A survey of techniques for designing and managing cpu register file,” Concurrency and Computation: Practice and Experience, vol. 29, no. 4, p. e3906, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3906
  46. S. Ghose, T. Li, N. Hajinazar, D. S. Cali, and O. Mutlu, “Demystifying complex workload-dram interactions: An experimental study,” vol. 3, no. 3, Dec. 2019. [Online]. Available: https://doi.org/10.1145/3366708
  47. Y. Zhao, J. Li, C. Liao, and X. Shen, “Bridging the gap between deep learning and sparse matrix format selection,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 94–108. [Online]. Available: https://doi.org/10.1145/3178487.3178495
  48. Z. Xie, G. Tan, W. Liu, and N. Sun, “Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication,” in Proceedings of the ACM International Conference on Supercomputing, ser. ICS ’19.   New York, NY, USA: Association for Computing Machinery, 2019, p. 94–105. [Online]. Available: https://doi.org/10.1145/3330345.3330354
  49. S. E. Kurt, S. Raje, A. Sukumaran-Rajam, and P. Sadayappan, “Sparsity-aware tensor decomposition,” in 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2022, pp. 952–962.
  50. S. Wijeratne, T.-Y. Wang, R. Kannan, and V. Prasanna, “Accelerating sparse mttkrp for tensor decomposition on fpga,” in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 259–269.
  51. S. Wijeratne, R. Kannan, and V. Prasanna, “Dynasor: A dynamic memory layout for accelerating sparse mttkrp for tensor decomposition on multi-core cpu,” in 2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).   IEEE, 2023, pp. 23–33.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets