KAPLA: Pragmatic Representation and Fast Solving of Scalable NN Accelerator Dataflow (2306.15676v1)
Abstract: Dataflow scheduling decisions are of vital importance to neural network (NN) accelerators. Recent scalable NN accelerators support a rich set of advanced dataflow techniques. The problems of comprehensively representing and quickly finding optimized dataflow schemes thus become significantly more complicated and challenging. In this work, we first propose comprehensive and pragmatic dataflow representations for temporal and spatial scheduling on scalable multi-node NN architectures. An informal hierarchical taxonomy highlights the tight coupling across different levels of the dataflow space as the major difficulty for fast design exploration. A set of formal tensor-centric directives accurately express various inter-layer and intra-layer schemes, and allow for quickly determining their validity and efficiency. We then build a generic, optimized, and fast dataflow solver, KAPLA, which makes use of the pragmatic directives to explore the design space with effective validity check and efficiency estimation. KAPLA decouples the upper inter-layer level for fast pruning, and solves the lower intra-layer schemes with a novel bottom-up cost descending method. KAPLA achieves within only 2.2% and 7.7% energy overheads on the result dataflow for training and inference, respectively, compared to the exhaustively searched optimal schemes. It also outperforms random and machine-learning-based approaches, with more optimized results and orders of magnitude faster search speedup.
- B. H. Ahn, P. Pilligundla, A. Yazdanbakhsh, and H. Esmaeilzadeh, “Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation,” in 8th International Conference on Learning Representations (ICLR), 2020.
- T. Akiba, S. Suzuki, and K. Fukuda, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes,” arXiv preprint arXiv:1711.04325, 2017.
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 1–13.
- M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-Layer CNN Accelerators,” in 49th International Symposium on Microarchitecture (MICRO), 2016, pp. 22:1–22:12.
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema et al., “Pencil: A Platform-Neutral Compute Intermediate Language for Accelerator Programming,” in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 138–149.
- T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to Optimize Tensor Programs,” in 32nd International Conference on Neural Information Processing Systems (NeurIPS), 2018, pp. 3389–3400.
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,” in 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269–284.
- Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 367–379.
- Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in 2016 International Solid-State Circuits Conference (ISSCC), 2016, pp. 262–263.
- Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning Supercomputer,” in 47th International Symposium on Microarchitecture (MICRO), 2014, pp. 609–622.
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 27–39.
- Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in 42nd International Symposium on Computer Architecture (ISCA), 2015, pp. 92–104.
- C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision,” in 2011 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2011, pp. 109–116.
- J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger, “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” in 45th International Symposium on Computer Architecture (ISCA), 2018, pp. 1–14.
- M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 751–764.
- M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 807–820.
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- K. Hegde, P.-A. Tsai, S. Huang, V. Chandra, A. Parashar, and C. W. Fletcher, “Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search,” in 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2021, pp. 943–958.
- J. Hennessy and D. Patterson, “A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development,” 2017 ACM A.M. Turing Award Lecture, 2018.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint arXiv:1704.04861, 2017.
- Q. Huang, A. Kalaiah, M. Kang, J. Demmel, G. Dinh, J. Wawrzynek, T. Norell, and Y. S. Shao, “CoSA: Scheduling by Constrained Optimization for Spatial Accelerators,” in 48th International Symposium on Computer Architecture (ISCA), 2021, pp. 554–566.
- Z. Jia, M. Zaharia, and A. Aiken, “Beyond Data and Model Parallelism for Deep Neural Networks,” in 2nd Conference on Systems and Machine Learning (SysML), 2019.
- N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in 44th International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 380–392.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in 26th International Conference on Neural Information Processing Systems (NeurIPS), 2012, pp. 1097–1105.
- H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach,” in 52nd International Symposium on Microarchitecture (MICRO), 2019, p. 754–768.
- H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects,” in 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 461–475.
- H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks,” in 26th International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–9.
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,” in 42nd International Symposium on Microarchitecture (MICRO), 2009, pp. 469–480.
- S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An Instruction Set Architecture for Neural Networks,” in 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 393–405.
- Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing CNN Model Inference on CPUs,” in 2019 USENIX Annual Technical Conference (ATC), 2019, pp. 1025–1040.
- L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y. Liang, “TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation,” in 48th International Symposium on Computer Architecture (ISCA), 2021, pp. 720–733.
- W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” in 23rd International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 553–564.
- Micron Technology Inc., “Mobile LPDDR4 SDRAM: 272b: x64 Mobile LPDDR4 SDRAM Features,” 2014.
- A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean, “A hierarchical model for device placement,” in International Conference on Learning Representations (ICLR), 2018.
- A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device Placement Optimization with Reinforcement Learning,” in 34th International Conference on Machine Learning (ICML), 2017, pp. 2430–2439.
- A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in 2019 International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315.
- A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in 44th International Symposium on Computer Architecture, 2017, pp. 27–40.
- H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient Neural Architecture Search via Parameter Sharing,” arXiv preprint arXiv:1802.03268, 2018.
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines,” in 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013, pp. 519–530.
- Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture,” in 52nd International Symposium on Microarchitecture (MICRO), 2019, pp. 14–27.
- Y. Shen, M. Ferdman, and P. Milder, “Maximizing CNN Accelerator Efficiency Through Resource Partitioning,” in 44th International Symposium on Computer Architecture (ISCA), 2017, pp. 535–547.
- K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, Sept 2014.
- L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators,” in 26th International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 342–355.
- L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array,” in 25th International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 56–68.
- L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” in 23rd International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 541–552.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in 28th International Conference on Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
- M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “MnasNet: Platform-Aware Neural Architecture Search for Mobile,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 2815–2823.
- Z. Tan, H. Cai, R. Dong, and K. Ma, “NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators,” in 48th International Symposium on Computer Architecture (ISCA), 2021, pp. 1013–1026.
- P.-A. Tsai, N. Beckmann, and D. Sanchez, “Jenga: Sotware-Defined Cache Hierarchies,” in 44th International Symposium on Computer Architecture (ISCA), 2017, pp. 652–665.
- S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, “ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks,” in 44th International Symposium on Computer Architecture (ISCA), 2017, pp. 13–26.
- R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al., “MAGNet: A Modular Accelerator Generator for Neural Networks,” in 2019 International Conference on Computer-Aided Design (ICCAD), 2019, pp. 1–8.
- B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 10 726–10 734.
- X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators,” in 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, p. 369–383.
- X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz, “A Systematic Approach to Blocking Convolutional Neural Networks,” arXiv preprint arXiv:1606.04209, Jun 2016.
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes,” arXiv preprint arXiv:1904.00962, 2019.
- B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning,” arXiv preprint arXiv:1611.01578, 2016.
- Zhiyao Li (4 papers)
- Mingyu Gao (22 papers)