PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System (2404.07164v2)
Abstract: Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms on the real-world UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and highlight the need for a shift to an algorithm-hardware codesign. Our results demonstrate three major findings: 1) The UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research.
- S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with Memory,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- L. Bottou, “Large-Scale Machine Learning with Stochastic Gradient Descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Springer, 2010.
- J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla, W. Wu, and C. Zhang, “Towards Demystifying Serverless Machine Learning Training,” in Proceedings of the 2021 International Conference on Management of Data, 2021.
- P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning,” arXiv preprint arXiv:2211.04325, 2022.
- M. Wang, W. Fu, X. He, S. Hao, and X. Wu, “A Survey on Large-Scale Machine Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, 2020.
- C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, and H. Pozidis, “Snap ML: A Hierarchical Framework for Machine Learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- B. Cottier, “Trends in the dollar training cost of machine learning systems,” 2023, accessed: 2024-02-25. [Online]. Available: https://epochai.org/blog/trends-in-the-dollar-training-cost-of-machine-learning-systems
- J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture,” arXiv preprint arXiv:2105.03814, 2021.
- J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware,” in 2021 12th International Green and Sustainable Computing Conference (IGSC). IEEE, 2021.
- A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data Movement is All You Need: A Case Study on Optimizing Transformers,” in MLSys, 2021.
- A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan et al., “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
- O. Mutlu, “Intelligent Architectures for Intelligent Computing Systems,” DATE — Invited Talk, 2021.
- O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Processing data where it makes sense: Enabling in-memory computation,” Microprocessors and Microsystems, vol. 67, 2019.
- O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing in Memory,” in Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann. Springer, 2022, pp. 171–243.
- S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, “Processing-in-memory: A workload-driven perspective,” IBM Journal of Research and Development, vol. 63, 2019.
- V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” arXiv preprint arXiv:1905.09822, 2019.
- O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Enabling Practical Processing in and near Memory for Data-Intensive Computing,” in Proceedings of the 56th Annual Design Automation Conference 2019, 2019.
- W. H. Kautz, “Cellular Logic-in-Memory Arrays,” IEEE Transactions on Computers, vol. 100, 1969.
- H. S. Stone, “A Logic-in-Memory Computer,” IEEE Transactions on Computers, vol. 100, 1970.
- J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System,” arXiv preprint arXiv:2207.07886, 2022.
- J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2023.
- Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim et al., “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021.
- L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho, J. H. Kim, Y. Kwon et al., “Near-Memory Processing in Action: Accelerating Personalized Recommendation With AXDIMM,” IEEE Micro, 2021.
- S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin et al., “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product,” in ISCA, 2021.
- S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim et al., “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022.
- A. A. Khan, J. P. C. De Lima, H. Farzaneh, and J. Castrillon, “The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview,” arXiv preprint arXiv:2401.14428, 2024.
- F. Devaux, “The true Processing in Memory accelerator,” in 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019.
- C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, 2022.
- C. Giannoula, P. Yang, I. F. Vega, J. Yang, Y. X. Li, J. G. Luna, M. Sadrosadati, O. Mutlu, and G. Pekhimenko, “Accelerating Graph Neural Networks on Real Processing-In-Memory Systems,” arXiv preprint arXiv:2402.16731, 2024.
- B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology,” arXiv preprint arXiv:2308.00846, 2023.
- M. Gao, G. Ayers, and C. Kozyrakis, “Practical Near-Data Processing for In-Memory Analytics Frameworks,” in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015.
- H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and H. Sarbazi-Azad, “ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning,” arXiv preprint arXiv:1812.11473, 2018.
- J. Vieira, N. Roma, P. Tomás, P. Ienne, and G. Falcao, “Exploiting Compute Caches for Memory Bound Vector Operations,” in 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 2018.
- Z. Sun, G. Pedretti, A. Bricalli, and D. Ielmini, “One-step regression and classification with cross-point resistive memory arrays,” Science advances, vol. 6, 2020.
- C. F. Shelor and K. M. Kavi, “Reconfigurable dataflow graphs for processing-in-memory,” in Proceedings of the 20th International Conference on Distributed Computing and Networking, 2019.
- J. Saikia, S. Yin, Z. Jiang, M. Seok, and J.-s. Seo, “K-Nearest Neighbor Hardware Accelerator Using In-Memory Computing SRAM,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2019.
- B. T. Polyak, “Introduction to optimization,” 1987.
- P. Zhou, J. Feng, C. Ma, C. Xiong, S. C. H. Hoi et al., “Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- UPMEM. (2024) UPMEM website. https://www.upmem.com. Accessed: 2024-02-19.
- J. Chen, J. Gómez-Luna, I. El Hajj, Y. Guo, and O. Mutlu, “SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory,” in 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2023.
- M. Zinkevich, M. Weimer, L. Li, and A. Smola, “Parallelized Stochastic Gradient Descent,” Advances in neural information processing systems, vol. 23, 2010.
- R. McDonald, K. Hall, and G. Mann, “Distributed Training Strategies for the Structured Perceptron,” in Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2010.
- O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal Distributed Online Prediction Using Mini-Batches,” Journal of Machine Learning Research, vol. 13, 2012.
- M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication Efficient Distributed Machine Learning with the Parameter Server,” Advances in Neural Information Processing Systems, vol. 27, 2014.
- S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends® in Machine learning, vol. 3, 2011.
- G. Amato, F. Falchi, C. Gennaro, and F. Rabitti, “YFCC100M-HNfc6: A Large-scale Deep Features Benchmark for Similarity Search,” in SISAP, 2016.
- Criteo AI Lab, “Criteo 1TB Click Logs Dataset,” https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/, 2014, accessed: 2024-01-31.
- J. Zhang, C. De Sa, I. Mitliagkas, and C. Ré, “Parallel SGD: When does averaging help?” arXiv preprint arXiv:1606.07365, 2016.
- H. Yu, S. Yang, and S. Zhu, “Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019.
- UPMEM, “UPMEM SDK, version 2023.2.0,” https://sdk.upmem.com/2023.2.0/, 2023, accessed: 2023-08-28.
- X. Xie, W. Tan, L. L. Fong, and Y. Liang, “CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs,” in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017.
- C. De Sa, M. Feldman, C. Ré, and K. Olukotun, “Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent,” in Proceedings of the 44th annual international symposium on computer architecture, 2017.
- H. Kim, H. Park, T. Kim, K. Cho, E. Lee, S. Ryu, H.-J. Lee, K. Choi, and J. Lee, “GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.
- D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, “TABLA: A unified template-based framework for accelerating statistical machine learning,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016.
- J. Wang, W. Wang, and N. Srebro, “Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox,” in Conference on Learning Theory. PMLR, 2017.
- W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 6, 2015.
- W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization,” in Advances in Knowledge Discovery and Data Mining: 19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part I 19. Springer, 2015.
- M. Item, J. Gómez-Luna, G. F. Oliveira, M. Sadrosadati, Y. Guo, and O. Mutlu, “TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,” in ISPASS, 2023.
- J. D. Ferreira, G. Falcao, J. Gómez-Luna, M. Alser, L. Orosa, M. Sadrosadati, J. S. Kim, G. F. Oliveira, T. Shahroodi, A. Nori et al., “pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022.
- Z. Wang, K. Kara, H. Zhang, G. Alonso, O. Mutlu, and C. Zhang, “Accelerating generalized linear models with MLWeaving: a one-size-fits-all system for any-precision learning,” Proceedings of the VLDB Endowment, vol. 12, 2019.
- E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave,” iEEE Micro, vol. 38, 2018.
- S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2015.
- Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” in Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, 2017.
- B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “YFCC100M: The New Data in Multimedia Research,” Communications of the ACM, 2016.
- R.-E. Fan, “LIBSVM Data: A Collection of Benchmarks for Support Vector Machine Research,” https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed: 2023-01-31.
- S. Geman, E. Bienenstock, and R. Doursat, “Neural Networks and the Bias/Variance Dilemma,” Neural computation, vol. 4, 1992.
- F. Zhou and G. Cong, “On the convergence properties of a K𝐾Kitalic_K-step averaging stochastic gradient descent algorithm for nonconvex optimization,” arXiv preprint arXiv:1708.01012, 2017.
- Z. Zhang, J. Jiang, W. Wu, C. Zhang, L. Yu, and B. Cui, “MLlib*: Fast Training of GLMs Using Spark MLlib,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE Computer Society, 2019.
- M. Wortsman, S. Gururangan, S. Li, A. Farhadi, L. Schmidt, M. Rabbat, and A. S. Morcos, “Lo-fi: Distributed Fine-tuning Without Communication,” arXiv:2210.11948, 2022.
- J. Liu, C. Zhang et al., “Distributed Learning Systems with First-Order Methods,” Foundations and Trends® in Databases, vol. 9, 2020.
- X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,” Advances in neural information processing systems, vol. 30, 2017.
- G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, 2020.
- J. Nider, C. Mustard, A. Zoltan, J. Ramsden, L. Liu, J. Grossbard, M. Dashti, R. Jodin, A. Ghiti, J. Chauzi et al., “A case study of Processing-in-Memory in off-the-Shelf systems,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021.
- B. Peccerillo, M. Mannino, A. Mondelli, and S. Bartolini, “A survey on hardware accelerators: Taxonomy, trends, challenges, and perspectives,” Journal of Systems Architecture, vol. 129, 2022.
- UPMEM, “UPMEM PIM platform for Data-Intensive Applications,” in ABUMPIMP. Symposium as part of Euro-Par, 2023, accessed: 2024-04-05. [Online]. Available: {https://www.youtube.com/watch?v=xsTp6raY6fE}
- A. A. Khan, H. Farzaneh, K. F. Friebel, C. Fournier, L. Chelini, and J. Castrillon, “CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,” arXiv preprint arXiv:2301.07486, 2022.
- B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024.
- L.-C. Chen, C.-C. Ho, and Y.-H. Chang, “UpPipe: A Novel Pipeline Management on In-Memory Processors for RNA-seq Quantification,” in 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023.
- S. Diab, A. Nassereldine, M. Alser, J. Gómez Luna, O. Mutlu, and I. El Hajj, “A framework for high-throughput sequence alignment using real processing-in-memory systems,” Bioinformatics, vol. 39, 2023.
- D. Lavenier, R. Cimadomo, and R. Jodin, “Variant Calling Parallelization on Processor-in-Memory Architecture,” in 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020.
- D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy, “BLAST on UPMEM,” Ph.D. dissertation, INRIA Rennes-Bretagne Atlantique, 2016.
- H. Gupta, M. Kabra, J. Gómez-Luna, K. Kanellopoulos, and O. Mutlu, “Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System,” in 2023 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2023.
- G. Jonatan, H. Cho, H. Son, X. Wu, N. Livesay, E. Mora, K. Shivdikar, J. L. Abellán, A. Joshi, D. Kaeli et al., “Scalability Limitations of Processing-in-Memory using Real System Evaluations,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 8, 2024.
- A. Bernhardt, A. Koch, and I. Petrov, “pimDB: From Main-Memory DBMS to Processing-In-Memory DBMS-Engines on Intelligent Memories,” in Proceedings of the 19th International Workshop on Data Management on New Hardware, 2023.
- C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim, “Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,” Proceedings of the ACM on Management of Data, vol. 1, 2023.
- A. Baumstark, M. A. Jibril, and K.-U. Sattler, “Adaptive Query Compilation with Processing-in-Memory,” in 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW). IEEE, 2023.
- A. Baumstark, M. A. Jibril, and K.-U. Sattler, “Accelerating Large Table Scan Using Processing-In-Memory Technology,” Datenbank-Spektrum, vol. 23, 2023.
- H. Kang, Y. Zhao, G. E. Blelloch, L. Dhulipala, Y. Gu, C. McGuffey, and P. B. Gibbons, “PIM-trie: A Skew-resistant Trie for Processing-in-Memory,” in Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, 2023.
- P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Ganguly, “Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,” in 2022 IEEE 35th International System-on-Chip Conference (SOCC). IEEE, 2022.
- N. Zarif, “Offloading embedding lookups to processing-in-memory for deep learning recommender models,” Master’s thesis, University of British Columbia, 2023, accessed: 2024-04-05. [Online]. Available: https://open.library.ubc.ca/collections/ubctheses/24/items/1.0435518
- S. Y. Kim, J. Lee, Y. Paik, C. H. Kim, W. J. Lee, and S. W. Kim, “Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference,” ACM Transactions on Design Automation of Electronic Systems, vol. 29, 2024.
- Y. Wu, Z. Wang, and W. D. Lu, “PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers,” arXiv preprint arXiv:2310.09385, 2023.
- Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “DrAcc: A DRAM based accelerator for accurate CNN inference,” in Proceedings of the 55th annual design automation conference, 2018.
- M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.
- A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,” in 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2021.
- S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo, “McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge,” IEEE Access, vol. 8, 2020.
- H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, 2018.
- E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, 2017.
- Y. Kwon, Y. Lee, and M. Rhu, “TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.
- L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee et al., “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020.
- A. S. Cordeiro, S. R. dos Santos, F. B. Moreira, P. C. Santos, L. Carro, and M. A. Alves, “Machine Learning Migration for Efficient Near-Data Processing,” in 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, 2021.
- Y. S. Lee and T. H. Han, “Task Parallelism-Aware Deep Neural Network Scheduling on Multiple Hybrid Memory Cube-Based Processing-in-Memory,” IEEE Access, vol. 9, 2021.
- N. Park, S. Ryu, J. Kung, and J.-J. Kim, “High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 26, 2021.
- J. Park, B. Kim, S. Yun, E. Lee, M. Rhu, and J. H. Ahn, “TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.
- B. Kim, J. Chung, E. Lee, W. Jung, S. Lee, J. Choi, J. Park, M. Wi, S. Lee, and J. H. Ahn, “MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks,” IEEE Transactions on Computers, vol. 69, 2020.
- D. Niu, S. Li, Y. Wang, W. Han, Z. Zhang, Y. Guan, T. Guan, F. Sun, F. Xue, L. Duan et al., “184QPS/W 64Mb/mm 2 3D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022.
- J. Wang, Y. Lu, B. Yuan, B. Chen, P. Liang, C. De Sa, C. Re, and C. Zhang, “CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks,” in International Conference on Machine Learning. PMLR, 2023.
- H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, “Compressed communication for distributed deep learning: Survey and quantitative evaluation,” Tech. Rep., 2020.
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in neural information processing systems, vol. 30, 2017.
- N. Ström, “Scalable distributed DNN training using commodity GPU cloud computing,” Sixteenth annual conference of the international speech communication association, 2015.