Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders (2404.19381v3)
Abstract: Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL$.$mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory accesses can result in significant slowdowns for memory-bound applications whether they are latency-sensitive or bandwidth-intensive. The near-data processing (NDP) in the CXL controller promises to overcome such limitations of passive CXL memory. However, prior work on NDP in CXL memory proposes application-specific units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but existing CXL$.$io/PCIe-based mechanisms incur $\mu$s-scale latency and are not suitable for fine-grained NDP. To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M$2$NDP), which comprises memory-mapped functions (M$2$func) and memory-mapped $\mu$threading (M$2\mu$thread). M$2$func is a CXL$.$mem-compatible low-overhead communication mechanism between the host processor and NDP controller in CXL memory. M$2\mu$thread enables low-cost, general-purpose NDP unit design by introducing lightweight $\mu$threads that support highly concurrent execution of kernels with minimal resource wastage. Combining them, M$2$NDP achieves significant speedups for various workloads by up to 128x (14.5x overall) and reduces energy by up to 87.9% (80.3% overall) compared to baseline CPU/GPU hosts with passive CXL memory.
- “Apache Arrow.” [Online]. Available: https://arrow.apache.org/docs/
- “Cacti: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model.” [Online]. Available: https://www.hpl.hp.com/research/cacti/
- “Genoa - cores - amd,” WikiChip. [Online]. Available: https://en.wikichip.org/wiki/amd/cores/genoa
- “Graph500 Benchmark specification.” [Online]. Available: https://graph500.org/?page_id=12
- “Polars: Lightning-fast DataFrame library for Rust and Python.” [Online]. Available: https://www.pola.rs/
- “RISC-V ”V” Vector Extension.” [Online]. Available: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
- “Thread management,” Threading Programming Guide, Apple. [Online]. Available: https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Multithreading/CreatingThreads/CreatingThreads.html
- “Vector amo extension,” RISC-V Vector extension specification, May. [Online]. Available: https://github.com/riscv/riscv-v-spec/blob/master/v-amo.adoc
- “Address translation services revision 1.1.” Peripheral Component Interconnect Special Interest Group (PCI-SIG)., 2009. [Online]. Available: https://www.pcisig.com/specifications/iov/ats/
- “TPC BENCHMARK™ H (Decision Support) Standard Specification Revision 2.17.1,” Transaction Processing Performance Council (TPC), November 2014. [Online]. Available: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
- “RISC-V in NVIDIA,” 6th RISC-V Workshop, May 2017. [Online]. Available: https://riscv.org/wp-content/uploads/2017/05/Tue1345pm-NVIDIA-Sijstermans.pdf
- “AMD EPYC™ 75F3,” March 2021. [Online]. Available: https://www.amd.com/ko/products/cpu/amd-epyc-75f3
- “Hardware-based cache flush engine,” Arm CoreLink™ CI‑700 Coherent Interconnect Technical Reference Manual, May 2021. [Online]. Available: https://developer.arm.com/documentation/101569/0300/SLC-memory-system/SLC-memory-system-components-and-configuration/Hardware-based-cache-flush-engine?lang=en
- “NVIDIA AMPERE GA102 GPU ARCHITECTURE,” 2021. [Online]. Available: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
- “Compute Express Link Specification 3.1,” CXL Consortium, August 2023.
- “Compute Express Link Specification 3.1,” CXL Consortium, August 2023, section 3.3.2.1 “Direct P2P CXL.mem for Accelerators”.
- “Intel® Advanced Vector Extensions 10 Architecture Specification,” July 2023. [Online]. Available: https://cdrdv2.intel.com/v1/dl/getContent/784267
- “Intel® fpga compute express link (cxl) ip,” May 2023. [Online]. Available: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.html
- “NVIDIA H100 Tensor Core GPU Architecture,” 2023. [Online]. Available: https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
- “CUDA C++ Best Practices Guide,” March 2024, section 1.4. “Recommendations and Best Practices”. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#recommendations-and-best-practices
- M. Adnan, Y. E. Maboud, D. Mahajan, and P. J. Nair, “Accelerating recommendation system training by leveraging popular choices,” Proc. VLDB Endow., vol. 15, no. 1, p. 127–140, sep 2021. [Online]. Available: https://doi.org/10.14778/3485450.3485462
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” p. 105–117, 2015. [Online]. Available: https://doi.org/10.1145/2749469.2750386
- J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, New York, NY, USA, 2015, p. 336–348.
- M. Alian and N. S. Kim, “Netdimm: Low-latency near-memory network interface architecture,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 699–711. [Online]. Available: https://doi.org/10.1145/3352460.3358278
- M. Alian, S. W. Min, H. Asgharimoghaddam, A. Dhar, D. K. Wang, T. Roewer, A. McPadden, O. O’Halloran, D. Chen, J. Xiong, D. Kim, W.-m. Hwu, and N. S. Kim, “Application-transparent near-memory processing architecture with memory channel network,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 802–814. [Online]. Available: https://doi.org/10.1109/MICRO.2018.00070
- Amazon. Configure instance tenancy with a launch configuration. [Online]. Available: https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-dedicated-instances.html
- R. Armstrong, “S41793 - optimizing gpu utilization: Understanding mig and mps,” NVIDIA GTC 2022. [Online]. Available: https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41793/
- H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim, “Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783753
- D. Bacon, R. Rabbah, and S. Shukla, “Fpga programming for the masses: The programmability of fpgas must improve if they are to be part of mainstream computing.” Queue, vol. 11, no. 2, p. 40–52, feb 2013. [Online]. Available: https://doi.org/10.1145/2436696.2443836
- A. Biswas, “Sapphire rapids,” in 2021 IEEE Hot Chips 33 Symposium (HCS), 2021, pp. 1–22.
- A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google workloads for consumer devices: Mitigating data movement bottlenecks,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, p. 316–331.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
- S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: Understanding irregular GPGPU graph applications,” in Proceedings of the IEEE International Symposium on Workload Characterization, IISWC 2013, Portland, OR, USA, September 22-24, 2013. IEEE Computer Society, 2013, pp. 185–195. [Online]. Available: https://doi.org/10.1109/IISWC.2013.6704684
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009, pp. 44–54. [Online]. Available: https://doi.org/10.1109/IISWC.2009.5306797
- J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,” IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021.
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” in Proceedings of the 1st ACM Symposium on Cloud Computing, ser. SoCC ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 143–154. [Online]. Available: https://doi.org/10.1145/1807128.1807152
- B. Dally, “Gtc china 2020 keynote,” https://investor.nvidia.com/events-and-presentations/events-and-presentations/event-details/2020/GTC-China-2020-Keynote-Bill-Dally/default.aspx, 2020, [Online; accessed 18-February-2022].
- F. Devaux, “The true processing in memory accelerator,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–24.
- Diemert Eustache, Meynet Julien, P. Galland, and D. Lefortier, “Attribution modeling increases efficiency of bidding in display advertising,” in Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, NS, Canada, August, 14, 2017. ACM, 2017, p. To appear.
- M. Drumond, A. Daglis, N. Mirzadeh, D. Ustiugov, J. Picorel, B. Falsafi, B. Grot, and D. Pnevmatikatos, “The mondrian data engine,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 639–651. [Online]. Available: https://doi.org/10.1145/3079856.3080233
- N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 5547–5569. [Online]. Available: https://proceedings.mlr.press/v162/du22c.html
- A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 283–295.
- B. Fitzpatrick, “Distributed caching with memcached,” Linux J., vol. 2004, no. 124, p. 5, aug 2004.
- M. Flajslik and M. Rosenblum, “Network interface design for low latency Request-Response protocols,” in 2013 USENIX Annual Technical Conference (USENIX ATC 13). San Jose, CA: USENIX Association, Jun. 2013, pp. 333–346. [Online]. Available: https://www.usenix.org/conference/atc13/technical-sessions/presentation/flajslik
- M. Gao, G. Ayers, and C. Kozyrakis, “Practical near-data processing for in-memory analytics frameworks,” in 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015, pp. 113–124.
- M. Gao and C. Kozyrakis, “Hrl: Efficient and flexible reconfigurable logic for near-data processing,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 126–137. [Online]. Available: https://doi.org/10.1109/HPCA.2016.7446059
- M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 751–764. [Online]. Available: https://doi.org/10.1145/3037697.3037702
- A. Ghiti, “Virtual memory layout on risc-v linux,” February 2021. [Online]. Available: https://docs.kernel.org/riscv/vm-layout.html
- D. Gouk, M. Kwon, H. Bae, S. Lee, and M. Jung, “Memory pooling with cxl,” IEEE Micro, vol. 43, no. 2, pp. 48–57, 2023. [Online]. Available: https://doi.org/10.1109/MM.2023.3237491
- D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access, High-Performance memory disaggregation with DirectCXL,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 287–294. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/gouk
- B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang, “Biscuit: A framework for near-data processing of big data workloads,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 153–165. [Online]. Available: https://doi.org/10.1109/ISCA.2016.23
- M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. Millstein, and M. Kim, “Bigdebug: Debugging primitives for interactive big data processing in spark,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 784–795. [Online]. Available: https://doi.org/10.1145/2884781.2884813
- U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, “The architectural implications of facebook’s dnn-based personalized recommendation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 488–501.
- J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,” IEEE Access, vol. 10, pp. 52 565–52 608, 2022.
- D. Ha, Y. Oh, and W. W. Ro, “R2d2: Removing redundancy utilizing linearity of address generation in gpus,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3579371.3589039
- M. Ha, J. Sim, D. Moon, M. Rhee, J. Choi, B. Koh, E. Lim, and K. Park, “Cms: A computational memory solution for high-performance and power-efficient recommendation system,” in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 491–494. [Online]. Available: https://doi.org/10.1109/AICAS54282.2022.9869851
- B. Harris and N. Altiparmak, “When poll is more energy efficient than interrupt,” in Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems, ser. HotStorage ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 59–64. [Online]. Available: https://doi.org/10.1145/3538643.3539747
- M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. [Online]. Available: https://doi.org/10.1109/MICRO50266.2020.00040
- B. Herzog, L. Gerhorst, B. Heinloth, S. Reif, T. Hönig, and W. Schröder-Preikschat, “Intspect: Interrupt latencies in the linux kernel,” in 2018 VIII Brazilian Symposium on Computing Systems Engineering (SBESC), 2018, pp. 83–90. [Online]. Available: https://doi.org/10.1109/SBESC.2018.00021
- M. Hibben, “Tsmc, not intel, has the lead in semiconductor processes,” https://seekingalpha.com/article/4151376-tsmc-not-intel-lead-in-semiconductor-processes, 2018.
- B. Hong, G. Kim, J. H. Ahn, Y. Kwon, H. Kim, and J. Kim, “Accelerating linked-list traversal through near-data processing,” in 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), 2016, pp. 113–124. [Online]. Available: https://doi.org/10.1145/2967938.2967958
- B. Hong, Y. Ro, and J. Kim, “Multi-dimensional parallel training of winograd layer on memory-centric architecture,” in Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-51. IEEE Press, 2018, p. 682–695. [Online]. Available: https://doi.org/10.1109/MICRO.2018.00061
- K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, “Transparent offloading and mapping (tom): Enabling programmer-transparent near-data processing in gpu systems,” in Proceedings of the 43rd International Symposium on Computer Architecture, 2016. [Online]. Available: https://doi.org/10.1109/ISCA.2016.27
- K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation,” in 2016 IEEE 34th International Conference on Computer Design (ICCD), 2016, pp. 25–32. [Online]. Available: https://doi.org/10.1109/ICCD.2016.7753257
- W. Huangfu, K. T. Malladi, A. Chang, and Y. Xie, “Beacon: Scalable near-data-processing accelerators for genome analysis near memory pool with the cxl support,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 727–743. [Online]. Available: https://doi.org/10.1109/MICRO56248.2022.00057
- C. Hwang, K. Park, R. Shu, X. Qu, P. Cheng, and Y. Xiong, “ARK: GPU-driven code execution for distributed deep learning,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). Boston, MA: USENIX Association, Apr. 2023, pp. 87–101. [Online]. Available: https://www.usenix.org/conference/nsdi23/presentation/hwang
- A. Jaleel, E. Ebrahimi, and S. Duncan, “Ducati: High-performance address translation by extending tlb reach of gpu-accelerated systems,” ACM Trans. Archit. Code Optim., vol. 16, no. 1, mar 2019. [Online]. Available: https://doi.org/10.1145/3309710
- J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, “CXL-ANNS: Software-Hardware collaborative memory disaggregation and computation for Billion-Scale approximate nearest neighbor search,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul. 2023, pp. 585–600. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/jang
- Y. Jin, C.-F. Wu, D. Brooks, and G.-Y. Wei, “S3: Increasing gpu utilization during generative inference for higher throughput,” 2023.
- V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “Accelwattch: A power modeling framework for modern gpus,” in 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. [Online]. Available: https://doi.org/10.1145/3466752.3480063
- U. Kang, “Adding new value to memory subsystems through cxl,” Flash Memory Summit, August 2022. [Online]. Available: https://memverge.com/wp-content/uploads/2022/08/CXL-Forum_SKhynix.pdf
- Karl Freund, “Will AMD’s MI300 Beat NVIDIA In AI?” [Online]. Available: https://www.forbes.com/sites/karlfreund/2023/01/09/will-amds-mi300-beat-nvidia-in-ai/
- L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C.-J. Wu, M. Hempstead, and X. Zhang, “Recnmp: Accelerating personalized recommendation with near-memory processing,” in Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ser. ISCA ’20, 2020, p. 790–803.
- L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho, J. H. Kim, Y. Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park, J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, “Near-memory processing in action: Accelerating personalized recommendation with axdimm,” IEEE Micro, pp. 1–1, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3097700
- M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486. [Online]. Available: https://doi.org/10.1109/ISCA45697.2020.00047
- G. Kim, N. Chatterjee, M. O’Connor, and K. Hsieh, “Toward standardized near-data processing with unrestricted data placement for gpus,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’17. [Online]. Available: https://doi.org/10.1145/3126908.3126965
- K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, and H. Song, “Smt: Software-defined memory tiering for heterogeneous computing systems with cxl memory expander,” IEEE Micro, vol. 43, no. 2, pp. 20–29, 2023. [Online]. Available: https://doi.org/10.1109/MM.2023.3240774
- S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney, Y. S. Shao, and A. Gholami, “Full stack optimization of transformer inference: a survey,” 2023.
- Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 45–49, 2016.
- R. Kuper, I. Jeong, Y. Yuan, J. Hu, R. Wang, N. Ranganathan, and N. S. Kim, “A quantitative analysis and guideline of data streaming accelerator in intel 4th gen xeon scalable processors,” CoRR, vol. abs/2305.02480, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.02480
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Y. Kwon, Y. Lee, and M. Rhu, “Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, p. 740–753.
- Y. Kwon, Y. Lee, and M. Rhu, “Tensor casting: Co-designing algorithm-architecture for personalized recommendation training,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 235–248. [Online]. Available: https://doi.org/10.1109/HPCA51647.2021.00029
- S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3. [Online]. Available: https://doi.org/10.1109/ISSCC42614.2022.9731711
- S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00013
- D. Lemire, “Cost of a thread in c++ under linux.” [Online]. Available: https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/
- H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud platforms,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 574–587. [Online]. Available: https://doi.org/10.1145/3575693.3578835
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009, pp. 469–480.
- S. Liang, Y. Wang, C. Liu, H. Li, and X. Li, “Ins-dla: An in-ssd deep learning accelerator for near-data processing,” in 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 2019, pp. 173–179.
- M. LILJESON. GPU submission strategies. AMD. [Online]. Available: https://gpuopen.com/presentations/2022/gpuopen-gpu_submission-reboot_blue_2022.pdf
- J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao, “Processing-in-memory for energy-efficient neural network training: A heterogeneous approach,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 655–668. [Online]. Available: https://doi.org/10.1109/MICRO.2018.00059
- D. Lustig and M. Martonosi, “Reducing gpu offload latency via fine-grained cpu-gpu synchronization,” in 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 354–365. [Online]. Available: https://doi.org/10.1109/HPCA.2013.6522332
- S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 04, pp. 774–787, apr 2021.
- I. Magaki, M. Khazraee, L. V. Gutierrez, and M. B. Taylor, “Asic clouds: Specializing the datacenter,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 178–190. [Online]. Available: https://doi.org/10.1109/ISCA.2016.25
- H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, “Tpp: Transparent page placement for cxl-enabled tiered-memory,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 742–755. [Online]. Available: https://doi.org/10.1145/3582016.3582063
- P. Micikevicius, “Performance optimization: Programming guidelines and gpu architecture reasons behind them,” NVIDIA GPU Technology Conference, 2013. [Online]. Available: https://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
- N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP laboratories, vol. 27, April 2009.
- H. Naghibijouybari, A. Neupane, Z. Qian, and N. Abu-Ghazaleh, “Rendered insecure: Gpu side channel attacks are practical,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 2139–2153. [Online]. Available: https://doi.org/10.1145/3243734.3243831
- M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” 2019.
- NVIDIA. Multi-process service. [Online]. Available: https://docs.nvidia.com/deploy/mps/index.html
- P. O’Neil, E. O’Neil, X. Chen, and S. Revilak, “The star schema benchmark and augmented fact table indexing,” in Performance Evaluation and Benchmarking, R. Nambiar and M. Poess, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 237–252.
- P. Patel, E. Choukse, C. Zhang, Íñigo Goiri, A. Shah, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” 2023.
- A. Pattnaik, X. Tang, O. Kayiran, A. Jog, A. Mishra, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das, “Opportunistic computing in gpu architectures,” in Proceedings of the 46th International Symposium on Computer Architecture, ser. ISCA ’19, 2019, p. 210–223.
- J. Prout, “Expanding Beyond Limits With CXL™-based Memory,” Flash Memory Summit, August 2022. [Online]. Available: https://memverge.com/wp-content/uploads/2022/08/CXL-Forum_Samsung.pdf
- S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, “Ndc: Analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads,” in 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014, pp. 190–200. [Online]. Available: https://doi.org/10.1109/ISPASS.2014.6844483
- A. Raybuck, T. Stamler, W. Zhang, M. Erez, and S. Peter, “Hemem: Scalable tiered memory management for big data applications and real nvm,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, ser. SOSP ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 392–407. [Online]. Available: https://doi.org/10.1145/3477132.3483550
- J. H. Ryoo, N. Gulur, S. Song, and L. K. John, “Rethinking tlb designs in virtualized environments: A very large part-of-memory tlb,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 469–480. [Online]. Available: https://doi.org/10.1145/3079856.3080210
- N. Sakharnykh, “Everything you need to know about unified memory,” NVIDIA GPU Technology Conference, 2018.
- D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate microarchitectural simulation of thousand-core systems,” SIGARCH Comput. Archit. News, vol. 41, no. 3, p. 475–486, jun 2013. [Online]. Available: https://doi.org/10.1145/2508148.2485963
- B. C. Schwedock, P. Yoovidhya, J. Seibert, and N. Beckmann, “Täkō: A polymorphic cache hierarchy for general-purpose optimization of data movement,” in Proceedings of the 49th Annual International Symposium on Computer Architecture. New York, NY, USA: Association for Computing Machinery, 2022, p. 42–58.
- D. D. Sharma, “Compute express link (cxl): Enabling heterogeneous data-centric computing with heterogeneous memory hierarchy,” IEEE Micro, vol. 43, no. 2, pp. 99–109, 2023.
- D. D. Sharma, “Novel composable and scaleout architectures using compute express link,” IEEE Micro, vol. 43, no. 2, pp. 9–19, 2023.
- D. D. Sharma, R. Blankenship, and D. S. Berger, “An introduction to the compute express link (CXL) interconnect,” CoRR, vol. abs/2306.11227, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.11227
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020.
- A. Shrivastava, V. Lakshman, T. Medini, N. Meisburger, J. Engels, D. Torres Ramos, B. Geordie, P. Pranav, S. Gupta, Y. Adunukota, and S. Jain, “From research to production: Towards scalable and sustainable neural recommendation models on commodity cpu hardware,” in Proceedings of the 17th ACM Conference on Recommender Systems, ser. RecSys ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 1071–1074. [Online]. Available: https://doi.org/10.1145/3604915.3610249
- J. Sim, S. Ahn, T. Ahn, S. Lee, M. Rhee, J. Kim, K. Shin, D. Moon, E. Kim, and K. Park, “Computational cxl-memory solution for accelerating memory-intensive applications,” IEEE Computer Architecture Letters, vol. 22, no. 1, pp. 5–8, 2023. [Online]. Available: https://doi.org/110.1109/LCA.2022.3226482
- M. Soltaniyeh, V. L. Moutinho Dos Reis, M. Bryson, R. Martin, and S. Nagarakatte, “Near-storage acceleration of database query processing with smartssds,” in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2021, pp. 265–265. [Online]. Available: https://doi.org/10.1109/FCCM51124.2021.00052
- N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker, “The arm scalable vector extension,” IEEE Micro, vol. 37, no. 2, pp. 26–39, 2017. [Online]. Available: https://doi.org/10.1109/MM.2017.35
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center for Reliable and High-Performance Computing, 2012.
- C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic, “Dsent - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling,” in Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, ser. NOCS ’12. USA: IEEE Computer Society, 2012, p. 201–210.
- W. Sun, Z. Li, S. Yin, S. Wei, and L. Liu, “Abc-dimm: Alleviating the bottleneck of communication in dimm-based near-memory processing with inter-dimm broadcast,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 237–250. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00027
- X. Sun, H. Wan, Q. Li, C.-L. Yang, T.-W. Kuo, and C. J. Xue, “Rm-ssd: In-storage computing for large-scale recommendation inference,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 1056–1070.
- Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. J. S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Ahn, T. Xu, and N. S. Kim, “Demystifying CXL memory with genuine cxl-ready systems and devices,” in MICRO-56: 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23, 2023.
- S. Tamimi, F. Stock, A. Koch, A. Bernhardt, and I. Petrov, “An evaluation of using ccix for cache-coherent host-fpga interfacing,” in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2022, pp. 1–9. [Online]. Available: https://doi.org/10.1109/FCCM53951.2022.9786103
- G. Thomas-Collignon and V. Mehta, “Optimizing cuda applications for nvidia a100 gpu,” NVIDIA GTC, 2020. [Online]. Available: https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21819-optimizing-applications-for-nvidia-ampere-gpu-architecture.pdf
- B. Tian, Q. Chen, and M. Gao, “Abndp: Co-optimizing data access and load balance in near-data processing,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 3–17. [Online]. Available: https://doi.org/10.1145/3582016.3582026
- K. Tian, Y. Dong, and D. Cowperthwaite, “A full GPU virtualization solution with mediated Pass-Through,” in 2014 USENIX Annual Technical Conference (USENIX ATC 14). Philadelphia, PA: USENIX Association, Jun. 2014, pp. 121–132. [Online]. Available: https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian
- T. Vinçon, L. Weber, A. Bernhardt, A. Koch, I. Petrov, C. Knödler, S. Hardock, S. Tamimi, and C. Riegger, “nkv in action: Accelerating kv-stores on nativecomputational storage with near-data processing,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2981–2984, 2020. [Online]. Available: http://www.vldb.org/pvldb/vol13/p2981-vincon.pdf
- Z. Wang, J. Sim, E. Lim, and J. Zhao, “Enabling efficient large-scale deep learning training with cache coherent disaggregated memory systems,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022.
- Wikipedia. Apple m3. [Online]. Available: https://en.wikipedia.org/wiki/Apple_M3
- M. Wilkening, U. Gupta, S. Hsia, C. Trippel, C.-J. Wu, D. Brooks, and G.-Y. Wei, “Recssd: Near data processing for solid state drive based recommendation inference,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS 2021, p. 717–729.
- H. Wu and M. Becchi, “Evaluating thread coarsening and low-cost synchronization on intel xeon phi,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 1018–1029. [Online]. Available: https://doi.org/10.1109/IPDPS47924.2020.0010
- P. Xiang, Y. Yang, and H. Zhou, “Warp-level divergence in gpus: Characterization, impact, and mitigation,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 284–295.
- J. Yang, D. B. Minturn, and F. Hady, “When poll is better than interrupt,” in Proceedings of the 10th USENIX Conference on File and Storage Technologies, ser. FAST’12. USA: USENIX Association, 2012, p. 3.
- Z. Yang, Y. Lu, X. Liao, Y. Chen, J. Li, S. He, and J. Shu, “\textlambda-IO: A unified IO stack for computational storage,” in 21st USENIX Conference on File and Storage Technologies (FAST 23). Santa Clara, CA: USENIX Association, Feb. 2023, pp. 347–362. [Online]. Available: https://www.usenix.org/conference/fast23/presentation/yang-zhe
- D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’14, 2014, p. 85–98.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022.
- W. Zhang, Q. Chen, K. Fu, N. Zheng, Z. Huang, J. Leng, and M. Guo, “Astraea: Towards qos-aware and resource-efficient multi-stage gpu services,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 570–582. [Online]. Available: https://doi.org/10.1145/3503222.3507721
- Z. Zhou, C. Li, F. Yang, and G. Sun, “Dimm-link: Enabling efficient inter-dimm communication for near-memory processing,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 302–316. [Online]. Available: https://doi.org/10.1109/HPCA56546.2023.10071005
- Hyungkyu Ham (3 papers)
- Jeongmin Hong (7 papers)
- Geonwoo Park (3 papers)
- Yunseon Shin (2 papers)
- Okkyun Woo (2 papers)
- Wonhyuk Yang (3 papers)
- Jinhoon Bae (1 paper)
- Eunhyeok Park (28 papers)
- Hyojin Sung (5 papers)
- Euicheol Lim (2 papers)
- Gwangsun Kim (4 papers)