Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching (2401.06362v3)
Abstract: Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
- S. P. Vander Wiel and D. J. Lilja, “When caches aren’t enough: Data prefetching techniques,” Computer, vol. 30, no. 7, pp. 23–30, 1997.
- C. Carvalho, “The gap between processor and memory speeds,” in Proc. of IEEE International Conference on Control and Automation, 2002.
- S. Kumar and C. Wilkerson, “Exploiting spatial locality in data caches using spatial footprints,” in Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No. 98CB36235). IEEE, 1998, pp. 357–368.
- S. Mittal, “A survey of recent prefetching techniques for processor caches,” ACM Computing Surveys (CSUR), vol. 49, no. 2, pp. 1–35, 2016.
- P. Michaud, “Best-offset hardware prefetching,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016, pp. 469–480.
- A. Jain and C. Lin, “Linearizing irregular memory accesses for improved correlated prefetching,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013, pp. 247–259.
- P. Zhang, R. Kannan, A. Srivastava, A. V. Nori, and V. K. Prasanna, “Resemble: Reinforced ensemble framework for data prefetching,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 282–292.
- P. Zhang, A. Srivastava, A. V. Nori, R. Kannan, and V. K. Prasanna, “Fine-grained address segmentation for attention-based variable-degree prefetching,” in Proceedings of the 19th ACM International Conference on Computing Frontiers, 2022, pp. 103–112.
- P. Zhang, R. Kannan, and V. K. Prasanna, “Phases, modalities, spatial and temporal locality: Domain specific ml prefetcher for accelerating graph analytics,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–15.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- A. Srivastava, A. Lazaris, B. Brooks, R. Kannan, and V. K. Prasanna, “Predicting memory accesses: the road to compact ml-driven prefetcher,” in Proceedings of the International Symposium on Memory Systems, 2019, pp. 461–470.
- A. Srivastava, T.-Y. Wang, P. Zhang, C. A. F. De Rose, R. Kannan, and V. K. Prasanna, “Memmap: Compact and generalizable meta-lstm models for memory access prediction,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2020, pp. 57–68.
- P. Zhang, A. Srivastava, B. Brooks, R. Kannan, and V. K. Prasanna, “Raop: Recurrent neural network augmented offset prefetcher,” in The International Symposium on Memory Systems, 2020, pp. 352–362.
- P. Zhang, A. Srivastava, T.-Y. Wang, C. A. De Rose, R. Kannan, and V. K. Prasanna, “C-memmap: clustering-driven compact, adaptable, and generalizable meta-lstm models for memory access prediction,” International Journal of Data Science and Analytics, pp. 1–14, 2021.
- Z. Shi, A. Jain, K. Swersky, M. Hashemi, P. Ranganathan, and C. Lin, “A hierarchical neural model of data prefetching,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 861–873.
- M. Hashemi, K. Swersky, J. A. Smith, G. Ayers, H. Litz, J. Chang, C. Kozyrakis, and P. Ranganathan, “Learning memory access patterns,” arXiv preprint arXiv:1803.02329, 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- P. Zhang, R. Kannan, X. Tong, A. V. Nori, and V. K. Prasanna, “Sharp: Software hint-assisted memory access prediction for graph analytics,” in 2022 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2022, pp. 1–8.
- H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–128, 2010.
- D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in International Conference on Machine Learning. PMLR, 2021, pp. 992–1004.
- M. Islam, S. Banerjee, M. Meswani, and K. Kavi, “Prefetching as a potentially effective technique for hybrid memory optimization,” in Proceedings of the Second International Symposium on Memory Systems, ser. MEMSYS ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 220–231. [Online]. Available: https://doi.org/10.1145/2989081.2989129
- H. Choi and S. Park, “A survey of machine learning-based system performance optimization techniques,” Applied Sciences, vol. 11, no. 7, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/7/3235
- P. Zhang, R. Kannan, A. V. Nori, and V. K. Prasanna, “A2p: Attention-based memory access prediction for graph analytics,” 2022.
- C. Yang, X. Man, and J. Shao, “G&l: An attention-based model for improving prefetching in solid-state drives,” in 2023 International Joint Conference on Neural Networks (IJCNN), 2023, pp. 1–8.
- S. Rahman, M. Burtscher, Z. Zong, and A. Qasem, “Maximizing hardware prefetch effectiveness with machine learning,” in 2015 IEEE 17th International Conference on High Performance Computing and Communications, Aug. 2015, pp. 383–389.
- L. Peled, S. Mannor, U. Weiser, and Y. Etsion, “Semantic locality and context-based prefetching using reinforcement learning,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2015, pp. 285–297.
- F. Eris, M. S. Louis, K. Eris, J. L. Abellan, and A. Joshi, “Puppeteer: A random forest-based manager for hardware prefetchers across the memory hierarchy,” 2022.
- E. S. Alcorta, M. Madhav, S. Tetrick, N. J. Yadwadkar, and A. Gerstlauer, “Lightweight ml-based runtime prefetcher selection on many-core platforms,” 2023.
- S. Mohapatra and B. Panda, “Drishyam: An image is worth a data prefetcher,” in 32nd ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’23), ACM. ACM New York, NY, USA, October 2023. [Online]. Available: https://doi.org/10.1145/3528416.3530224
- T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neurocomputing, vol. 461, pp. 370–403, 2021.
- Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” 2019.
- S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021.
- T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” 10 2013, pp. 6655–6659.
- G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
- J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021.
- N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
- K. Abdelouahab, M. Pelcat, J. Serot, and F. Berry, “Accelerating cnn inference on fpgas: A survey,” 2018.
- S. Abi-Karam, Y. He, R. Sarkar, L. Sathidevi, Z. Qiao, and C. Hao, “Gengnn: A generic fpga framework for graph neural network acceleration,” 2022.
- D. P. Francis and K. Raimond, “A practical streaming approximate matrix multiplication algorithm,” J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 1, p. 1455–1465, jan 2022. [Online]. Available: https://doi.org/10.1016/j.jksuci.2018.09.010
- H. You, X. Chen, Y. Zhang, C. Li, S. Li, Z. Liu, Z. Wang, and Y. Lin, “Shiftaddnet: A hardware-inspired deep network,” 2020.
- M. Elhoushi, Z. Chen, F. Shafiq, Y. H. Tian, and J. Y. Li, “Deepshift: Towards multiplication-less neural networks,” 2021.
- J. M. Joyce, “Kullback-leibler divergence,” in International encyclopedia of statistical science. Springer, 2011, pp. 720–722.
- P. K. Meher, “An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks,” in 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip. IEEE, 2010, pp. 91–95.
- ”ChampSim”, “https://github.com/champsim/champsim/,” 2017.
- A. Jaleel, “Memory characterization of workloads using instrumentation-driven simulation,” Web Copy: http://www. glue. umd. edu/ajaleel/workload, 2010.
- S. CPU2017”, “The standard performance evaluation corporation,” https://www.spec.org/cpu2017/, 2017.
- H. T. Kung and C. E. Leiserson, “Systolic arrays (for vlsi),” in Sparse Matrix Proceedings 1978, vol. 1. Society for industrial and applied mathematics Philadelphia, PA, USA, 1979, pp. 256–282.
- C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and f-score, with implication for evaluation,” in European conference on information retrieval. Springer, 2005, pp. 345–359.
- V. Srinivasan, E. S. Davidson, and G. S. Tyson, “A prefetch taxonomy,” IEEE Transactions on Computers, vol. 53, no. 2, pp. 126–140, 2004.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.