Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor Network (2307.05740v2)
Abstract: Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.
- Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 269–285. https://doi.org/10.1145/3519939.3523442
- Brett W. Bader and Tamara G. Kolda. 2008. Efficient MATLAB Computations with Sparse and Factored Tensors. SIAM Journal on Scientific Computing 30, 1 (2008), 205–231. https://doi.org/10.1137/060676489
- Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 557–567. https://doi.org/10.1109/IPDPS.2018.00065
- Grey Ballard and Kathryn Rouse. 2020. General Memory-Independent Lower Bound for MTTKRP. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). SIAM, 1–11. https://doi.org/10.1137/1.9781611976137.1
- Mosaic: An Interoperable Compiler for Tensor Algebra. Proc. ACM Program. Lang. 7, PLDI, Article 122 (jun 2023), 26 pages. https://doi.org/10.1145/3591236
- Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE 93, 2 (2005), 276–292. https://doi.org/10.1109/JPROC.2004.840311
- Distributed-Memory Sparse Kernels for Machine Learning. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA, 47–58. https://doi.org/10.1109/IPDPS53621.2022.00014
- Memory-Constrained Data Locality Optimization for Tensor Contractions. In Languages and Compilers for Parallel Computing, Lawrence Rauchwerger (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 93–108.
- An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Software 28, 2 (2002), 135–151.
- CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP 2018). Association for Computing Machinery, New York, NY, USA, Article 21, 10 pages. https://doi.org/10.1145/3225058.3225133
- Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) (IA^3 ’15). Association for Computing Machinery, New York, NY, USA, Article 4, 8 pages. https://doi.org/10.1145/2833179.2833186
- Justus A. Calvin and Edward F. Valeev. 2023. TiledArray: A general-purpose scalable block-sparse tensor framework. https://github.com/valeevgroup/tiledarray
- John Canny and Huasha Zhao. 2013. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD ’13). Association for Computing Machinery, New York, NY, USA, 95–103. https://doi.org/10.1145/2487575.2487677
- Robust face clustering via tensor decomposition. IEEE transactions on cybernetics 45, 11 (2014), 2546–2557.
- Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568–577. https://doi.org/10.1109/IPDPS.2018.00066
- Joon Hee Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/d5cfead94f5350c12c322b5b664544c1-Paper.pdf
- SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring. In Proceedings of the 36th ACM International Conference on Supercomputing (Virtual Event) (ICS ’22). Association for Computing Machinery, New York, NY, USA, Article 15, 14 pages. https://doi.org/10.1145/3524059.3532386
- New implementation of high-level correlated methods using a general block-tensor library for high-performance electronic structure calculations. Journal of Computational Chemistry (2013).
- The ITensor Software Library for Tensor Network Calculations. SciPost Phys. Codebases (2022), 4. https://doi.org/10.21468/SciPostPhysCodeb.4
- A Systematic Survey of General Sparse Matrix-Matrix Multiplication. Comput. Surveys (nov 2022). https://doi.org/10.1145/3571157
- ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 319–333. https://doi.org/10.1145/3352460.3358275
- So Hirata. 2003. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories. The Journal of Physical Chemistry A 107, 46 (2003), 9887–9897.
- Edward Hutter and Edgar Solomonik. 2023. High-Dimensional Performance Modeling via Tensor Completion. arXiv preprint arXiv:2210.10184 (2023).
- Constructing Optimal Contraction Trees for Tensor Network Quantum Circuit Simulation. In 2022 IEEE High Performance Extreme Computing Conference (HPEC). 1–8. https://doi.org/10.1109/HPEC55821.2022.9926353
- HaTen2: Billion-scale tensor decompositions. In 2015 IEEE 31st International Conference on Data Engineering. 1047–1058. https://doi.org/10.1/109/ICDE.2015.7113355
- GigaTensor: Scaling Tensor Analysis up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Beijing, China) (KDD ’12). Association for Computing Machinery, New York, NY, USA, 316–324. https://doi.org/10.1145/2339530.2339583
- Daniel Kats and Frederick R Manby. 2013. Sparse tensor framework for implementation of general local correlation methods. The Journal of Chemical Physics 138, 14 (2013), 144101.
- Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11. https://doi.org/10.1145/2807591.2807624
- Venera Khoromskaia and Boris N Khoromskij. 2018. Tensor numerical methods in quantum chemistry. In Tensor Numerical Methods in Quantum Chemistry. De Gruyter.
- Henk A. L. Kiers. 2000. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics 14, 3 (2000), 105–122. https://doi.org/10.1002/1099-128X(200005/06)14:3<105::AID-CEM582>3.0.CO;2-I
- The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (oct 2017), 29 pages. https://doi.org/10.1145/3133901
- Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 842–853. https://doi.org/10.1109/IPDPS.2016.117
- Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455–500. https://doi.org/10.1137/07070111X
- Tensor completion based on nuclear norm minimization for 5D seismic data reconstruction. Geophysics 78, 6 (2013), V273–V284.
- Model-Driven Sparse CP Decomposition for Higher-Order Tensors. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1048–1057. https://doi.org/10.1109/IPDPS.2017.80
- ParTI! : A Parallel Tensor Infrastructure for multicore CPUs and GPUs. http://parti-project.org Last updated: Jan 2020.
- Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA^3). 26–33. https://doi.org/10.1109/IA3.2016.010
- Efficient and Effective Sparse Tensor Reordering. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS ’19). Association for Computing Machinery, New York, NY, USA, 227–237. https://doi.org/10.1145/3330345.3330366
- Tensor Completion for Estimating Missing Values in Visual Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 208–220. https://doi.org/10.1109/TPAMI.2012.39
- Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Heterogeneous Memory. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 318–333. https://doi.org/10.1145/3437801.3441581
- Igor L Markov and Yaoyun Shi. 2008. Simulating quantum computation by contracting tensor networks. SIAM J. Comput. 38, 3 (2008), 963–981.
- Sampled Dense Matrix Multiplication for High-Performance Machine Learning. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 32–41. https://doi.org/10.1109/HiPC.2018.00013
- Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1120–1131. https://doi.org/10.1109/ICDE.2018.00104
- Román Orús. 2014. Advances on tensor network theory: symmetries, fermions, entanglement, and holography. The European Physical Journal B 87, 11 (2014), 1–18.
- Sparse hierarchical tucker factorization and its application to healthcare. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 943–948.
- Faster identification of optimal contraction sequences for tensor networks. Phys. Rev. E 90 (Sep 2014), 033315. Issue 3. https://doi.org/10.1103/PhysRevE.90.033315
- Eric T. Phipps and Tamara G. Kolda. 2019. Software for Sparse Tensor Decomposition on Emerging Computing Architectures. SIAM Journal on Scientific Computing 41, 3 (2019), C269–C290. https://doi.org/10.1137/18M1210691 arXiv:https://doi.org/10.1137/18M1210691
- A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics. Computer Physics Communications (2017). https://doi.org/10.1016/j.cpc.2017.02.016
- Distributed-memory tensor completion for generalized loss functions in python using new sparse tensor kernels. J. Parallel and Distrib. Comput. 169 (2022), 269–285. https://doi.org/10.1016/j.jpdc.2022.07.005
- FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/
- Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) (IA^3 ’15). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/2833179.2833183
- Shaden Smith and George Karypis. 2017. Accelerating the Tucker Decomposition with Compressed Sparse Tensors. In Euro-Par 2017: Parallel Processing, Francisco F. Rivera, Tomás F. Pena, and José C. Cabaleiro (Eds.). Springer International Publishing, Cham, 653–668.
- An Exploration of Optimization Algorithms for High Performance Tensor Completion. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 359–371. https://doi.org/10.1109/SC.2016.30
- SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. 61–70. https://doi.org/10.1109/IPDPS.2015.27
- A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput. 74, 12 (2014), 3176–3190.
- Paul Springer and Paolo Bientinesi. 2018. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan 2018), 29 pages. https://doi.org/10.1145/3157733
- Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 689–702.
- The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code. Proc. IEEE 106, 11 (2018), 1921–1934. https://doi.org/10.1109/JPROC.2018.2857721
- An approach for code generation in the Sparse Polyhedral Framework. Parallel Comput. 53 (2016), 32–57. https://doi.org/10.1016/j.parco.2016.02.004
- A High Performance Sparse Tensor Algebra Compiler in MLIR. In 2021 IEEE/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 27–38. https://doi.org/10.1109/LLVMHPC54804.2021.00009
- L. R. Tucker. 1966c. Some mathematical notes on three-mode factor analysis. Psychometrika 31 (1966c), 279–311.
- Hasco: Towards agile hardware and software co-design for tensor computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1055–1068.
- DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 286–300. https://doi.org/10.1145/3519939.3523437
- SpDISTAL: Compiling Distributed Sparse Tensor Computations. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC ’22). IEEE Press, Article 59, 15 pages.
- A Pipeline Computing Method of SpTV for Three-Order Tensors on CPU and GPU. ACM Trans. Knowl. Discov. Data 13, 6, Article 63 (nov 2019), 27 pages. https://doi.org/10.1145/3363575
- High-Order Tensor Completion for Data Recovery via Sparse Tensor-Train Optimization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1258–1262. https://doi.org/10.1109/ICASSP.2018.8462592
- Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-Iteration. ACM Trans. Archit. Code Optim. 20, 1, Article 16 (dec 2022), 26 pages. https://doi.org/10.1145/3566054
- ReACT: Redundancy-Aware Code Generation for Tensor Expressions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Chicago, Illinois) (PACT ’22). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3559009.3569685
- Deinsum: Practically I/O Optimal Multi-Linear Algebra. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC ’22). IEEE Press, Article 25, 15 pages.