Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory (2310.01893v1)

Published 3 Oct 2023 in cs.AR, cs.DC, and cs.SE

Abstract: Data movement between memory and processors is a major bottleneck in modern computing systems. The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips. Real PIM hardware (e.g., the UPMEM system) is now available and has demonstrated potential in many applications. However, programming such real PIM hardware remains a challenge for many programmers. This paper presents a new software framework, SimplePIM, to aid programming real PIM systems. The framework processes arrays of arbitrary elements on a PIM device by calling iterator functions from the host and provides primitives for communication among PIM cores and between PIM and the host system. We implement SimplePIM for the UPMEM PIM system and evaluate it on six major applications. Our results show that SimplePIM enables 66.5% to 83.1% reduction in lines of code in PIM programs. The resulting code leads to higher performance (between 10% and 37% speedup) than hand-optimized code in three applications and provides comparable performance in three others. SimplePIM is fully and freely available at https://github.com/CMU-SAFARI/SimplePIM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. O. Mutlu et al., “A Modern Primer on Processing in Memory,” arXiv, vol. abs/2012.03112, 2020.
  2. O. Mutlu et al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation,” MicPro, 2019.
  3. S. Ghose et al., “Processing-in-Memory: A Workload-driven Perspective,” IBM JRD, 2019.
  4. O. Mutlu et al., “Enabling Practical Processing in and near Memory for Data-Intensive Computing,” DAC, 2019.
  5. J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” ISCA, 2015.
  6. S. Angizi and D. Fan, “Graphide: A Graph Processing Accelerator Leveraging In-DRAM-computing,” in GLSVLSI, 2019.
  7. L. Nai et al., “GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks,” in HPCA, 2017.
  8. J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
  9. M. Gokhale et al., “Near Memory Data Structure Rearrangement,” in MEMSYS, 2015.
  10. J. Gómez-Luria et al., “Machine Learning Training on a Real Processing-in-Memory System,” in ISVLSI, 2022.
  11. J. Gómez-Luna et al., “Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,” in ISPASS, 2023.
  12. J. Gómez-Luna et al., “An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System,” arXiv preprint arXiv:2207.07886, 2022.
  13. L. Ke et al., “Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM,” IEEE Micro, 2021.
  14. F. Liu et al., “PIM-DH: ReRAM-based Processing-in-Memory Architecture for Deep Hashing Acceleration,” DAC, 2022.
  15. A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
  16. A. Boroumand et al., “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,” in PACT, 2021.
  17. L. Ke et al., “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing,” in ISCA, 2020.
  18. N. Park et al., “High-Throughput Near-Memory Processing on CNNs with 3D HBM-Like Memory,” TODAES, 2021.
  19. B. Kim et al., “MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks,” IEEE TC, 2020.
  20. H. Sun et al., “An Energy-Efficient Quantized and Regularized Training Framework for Processing-in-Memory Accelerators,” in ASP-DAC, 2020.
  21. E. Azarkhish et al., “Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes,” IEEE TPDS, 2017.
  22. J. Park et al., “TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory,” in MICRO, 2021.
  23. J. H. Lee et al., “BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models,” in PACT, 2015.
  24. M. He et al., “Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning,” MICRO, 2020.
  25. D. Lee et al., “Improving In-Memory Database Operations with Acceleration DIMM (AxDIMM),” DAMON, 2022.
  26. J. Gómez-Luna et al., “Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture,” arXiv, vol. abs/2105.03814, 2021.
  27. V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” MICRO, 2017.
  28. A. Boroumand et al., “Polynesia: Enabling Effective Hybrid Transactional Analytical Databases with Specialized Hardware Software Co-Design,” in ICDE, 2022.
  29. C. Lim et al., “Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,” SIGMOD, 2023.
  30. S. Lloyd and M. Gokhale, “Near Memory Key/Value Lookup Acceleration,” in MEMSYS, 2017.
  31. C. Giannoula et al., “SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems,” arXiv, vol. abs/2204.00900, 2022.
  32. S. Gupta et al., “SCRIMP: A General Stochastic Computing Architecture using ReRAM in-Memory Processing,” DATE, 2020.
  33. G. Singh et al., “NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling,” in FPL, 2020.
  34. G. Singh et al., “Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric,” ACM TRETS, 2021.
  35. A. Denzler et al., “Casper: Accelerating Stencil Computations Using Near-Cache Processing,” IEEE Access, 2023.
  36. M. R. de Castro et al., “SparkBLAST: Scalable BLAST Processing using In-memory Operations,” BMC Bioinformatics, 2017.
  37. F. Zhang et al., “PIM-Quantifier: A Processing-in-Memory Platform for mRNA Quantification,” DAC, 2021.
  38. S. Angizi et al., “PIM-Assembler: A Processing-in-Memory Platform for Genome Assembly,” DAC, 2020.
  39. S. Angizi et al., “PIM-Aligner: A Processing-in-MRAM Platform for Biological Sequence Alignment,” DATE, 2020.
  40. S. Diab et al., “High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory,” in HICOMB, 2022.
  41. J. S. Kim et al., “GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping using Processing-in-memory Technologies,” BMC Genomics, 2018.
  42. G. Singh et al., “FPGA-based Near-memory Acceleration of Modern Data-intensive Applications,” IEEE Micro, 2021.
  43. F. Devaux, “The True Processing In Memory Accelerator,” HCS, 2019.
  44. UPMEM, “Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM Accelerator (White Paper),” 2018.
  45. V. Zois et al., “Massively Parallel Skyline Computation for Processing-in-memory Architectures,” PACT, 2018.
  46. D. Lavenier et al., “DNA Mapping using Processor-in-Memory Architecture,” BIBM, 2016.
  47. S. Diab et al., “A Framework for High-throughput Sequence Alignment using Real Processing-in-memory Systems,” Bioinformatics, 2023.
  48. J. Nider et al., “A Case Study of Processing-in-Memory in off-the-Shelf Systems,” in USENIX Annual Technical Conference, 2021.
  49. J. Nider et al., “Processing in Storage Class Memory,” in HotStorage, 2020.
  50. D. Lavenier et al., “Variant Calling Parallelization on Processor-in-Memory Architecture,” in BIBM, 2020.
  51. N. Abecassis et al., “GAPiM: Discovering Genetic Variations on a Real Processing-in-Memory System,” bioRxiv, 2023.
  52. M. Item et al., “TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,” in ISPASS, 2023.
  53. J. Gómez-Luna et al., “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,” IEEE Access, 2022.
  54. H. Gupta et al., “Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System,” in IISWC, 2023.
  55. Y.-C. Kwon et al., “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications,” ISSCC, 2021.
  56. S. Lee et al., “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in ISCA, 2021.
  57. S. Lee et al., “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022.
  58. D. Niu et al., “184QPS/W 64Mb/mm2 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System,” in ISSCC, 2022.
  59. UPMEM, “UPMEM Software Development Kit (SDK),” https://sdk.upmem.com, 2023.
  60. J. C. Corbett et al., “Spanner: Google’s Globally-Distributed Database,” ACM TOCS, 2012.
  61. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM, 2008.
  62. L. Lamport, “The Part-time Parliament,” ACM Trans. Comput. Syst., 1998.
  63. D. Ongaro and J. K. Ousterhout, “In Search of an Understandable Consensus Algorithm,” in USENIX ATC, 2014.
  64. P. Hunt et al., “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” in USENIX ATC, 2010.
  65. H. Garcia-Molina, “Elections in a Distributed Computing System,” IEEE TC, 1982.
  66. M. Castro, “Practical Byzantine fault tolerance,” in OSDI, 1999.
  67. K. R. Driscoll et al., “Byzantine Fault Tolerance, from Theory to Reality,” in SafeComp, 2003.
  68. D. Dolev and H. R. Strong, “Authenticated Algorithms for Byzantine Agreement,” SIAM J. Comput., 1983.
  69. L. Lamport et al., “The Byzantine Generals Problem,” ACM TOPLAS, 1982.
  70. M. A. Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in NSDI, 2012.
  71. M. P. I. Forum, “MPI: A Message-Passing Interface Standard,” 1994.
  72. L. Glimcher et al., “Scaling and Parallelizing a Scientific Feature Mining Application using a Cluster Middleware,” IPDPS, 2004.
  73. W. Jiang et al., “A Map-Reduce System with an Alternate API for Multi-core Environments,” CCGrid, 2010.
  74. W. Sun et al., “ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast,” ISCA, 2021.
  75. Z. Zhou et al., “DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing,” HPCA, 2023.
  76. L. Dagum and R. Menon, “OpenMP: an Industry Standard API for Shared-memory Programming,” 1998.
  77. SAFARI Research Group, “PrIM Benchmark Suite,” https://github.com/CMU-SAFARI/prim-benchmarks.
  78. SAFARI Research Group, “PIM Machine Learning Training Benchmarks,” https://github.com/CMU-SAFARI/pim-ml.
  79. Z. Qin et al., “A Novel Approximation Methodology and Its Efficient VLSI Implementation for the Sigmoid Function,” IEEE Transactions on Circuits and Systems II: Express Briefs, 2020.
  80. V. Seshadri et al., “RowClone: Fast and Energy-efficient In-DRAM Bulk Data Copy and Initialization,” MICRO, 2013.
  81. K. K. Chang et al., “Low-cost Inter-linked Subarrays (LISA): Enabling Fast Inter-subarray Data Movement in DRAM,” in HPCA, 2016.
  82. Y. Wang et al., “FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching,” in MICRO, 2020.
  83. S. H. S. Rezaei et al., “NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-Banked Memories,” CAL, 2020.
  84. H. K. Kang et al., “PIM-tree: A Skew-resistant Index for Processing-in-Memory,” Proc. VLDB Endow., 2022.
  85. J. Gómez-Luna et al., “Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware,” IGSC, 2021.
  86. C. Giannoula et al., “Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-in-Memory Architectures,” in SIGMETRICS, 2022.
  87. L.-C. Chen et al., “UpPipe: A Novel Pipeline Management on In-Memory Processors for RNA-seq Quantification,” in DAC, 2023.
  88. P. Das et al., “Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,” in SOCC, 2022.
  89. A. Bernhardt et al., “pimDB: From Main-Memory DBMS to Processing-In-Memory DBMS-Engines on Intelligent Memories,” in DaMoN, 2023.
  90. A. Baumstark et al., “Adaptive Query Compilation with Processing-in-Memory,” in ICDEW, 2023.
  91. A. Baumstark et al., “Accelerating Large Table Scan using Processing-In-Memory Technology,” BTW, 2023.
  92. M. Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems,” arXiv preprint arXiv:1906.00091, 2019.
  93. M. A. Ibrahim and S. Aga, “Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures,” arXiv preprint arXiv:2308.03973, 2023.
  94. N. Hajinazar et al., “SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM,” in ASPLOS, 2021.
  95. V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
  96. J. D. Ferreira et al., “pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables,” in MICRO, 2022.
  97. D. Fujiki et al., “Duality Cache for Data Parallel Acceleration,” in ISCA, 2019.
  98. Z. Wang et al., “Infinity Stream: Portable and Programmer-friendly In-/Near-memory Fusion,” in ASPLOS, 2023.
  99. X. Peng et al., “CHOPPER: A Compiler Infrastructure for Programmable Bit-serial SIMD Processing using Memory in DRAM,” in HPCA, 2023.
  100. A. A. Khan et al., “CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-memory and Compute Near-memory Paradigms,” arXiv preprint arXiv:2301.07486, 2022.
  101. C. Lattner et al., “MLIR: A Compiler Infrastructure for the End of Moore’s Law,” arXiv preprint arXiv:2002.11054, 2020.
  102. C. Lattner et al., “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,” in CGO, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jinfan Chen (3 papers)
  2. Juan Gómez-Luna (57 papers)
  3. Izzat El Hajj (17 papers)
  4. Yuxin Guo (21 papers)
  5. Onur Mutlu (279 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.