UDON: A case for offloading to general purpose compute on CXL memory (2404.02868v1)
Abstract: Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.
- “CXL Consortium,” https://www.computeexpresslink.org/.
- H. Li et al., “Pond: CXL-Based Memory Pooling Systems for Cloud Platforms,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vancouver, BC Canada, March 2023.
- J. H. Kim et al., “Samsung pim/pnm for transfmer based ai : Energy efficiency on pim/pnm cluster,” in 2023 IEEE Hot Chips 35 Symposium (HCS), 2023, pp. 1–31.
- J. Sim et al., “Computational cxl-memory solution for accelerating memory-intensive applications,” IEEE Computer Architecture Letters, vol. 22, no. 1, pp. 5–8, 2023.
- D. D. Sharma, “Novel composable and scaleout architectures using compute express link,” IEEE Micro, vol. 43, no. 2, pp. 9–19, 2023.
- M. Ahn et al., “Enabling cxl memory expansion for in-memory database management systems,” in Proceedings of the 18th International Workshop on Data Management on New Hardware, ser. DaMoN ’22. New York, NY, USA: Association for Computing Machinery, 2022. Available: https://doi.org/10.1145/3533737.3535090
- Y. Kwon et al., “Memory-centric computing with sk hynix’s domain-specific memory,” in 2023 IEEE Hot Chips 35 Symposium (HCS), 2023, pp. 1–26.
- D. D. Sharma, R. Blankenship, and D. S. Berger, “An introduction to the compute express link (cxl) interconnect,” arXiv preprint arXiv:2306.11227, 2023.
- “Samsung’s CXL Memory Expander,” https://semiconductor.samsung.com/news-events/news/samsung-develops-industrys-first-cxl-dram-supporting-cxl-2-0/.
- “Micron memory expansion module,” https://www.micron.com/solutions/server/cxl.
- “Amd epyc 4th gen processors with cxl 1.1,” https://www.amd.com/en/press-releases/2022-11-10-offering-unmatched-performance-leadership-energy-efficiency-and-next.
- “Intel 4th gen xeon scalable processors with cxl 1.1,” https://www.intel.com/content/www/us/en/newsroom/news/4th-gen-xeon-scalable-processors-max-series-cpus-gpus.html.
- “Enabling cxl within the data center with arm solutions,” https://drive.google.com/file/d/17Y1PumNyvIMLL04Ak3Pt5up217btTbcj/view.
- “2022 OCP Global Summit Key Takeaways - CXL’s Implications for Server Architecture,” https://www.opencompute.org/blog/2022-ocp-global-summit-key-takeaways-cxls-implications-for-server-architecture.
- M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. Available: https://www.tensorflow.org/
- A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” 2019.
- “FAISS case study,” https://github.com/facebookresearch/faiss/wiki/Case-studies.
- Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
- C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, p. 461–474, jan 2019. Available: https://doi.org/10.14778/3303753.3303754
- D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access,{{\{{High-Performance}}\}} memory disaggregation with {{\{{DirectCXL}}\}},” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 287–294.
- Y. Yang, P. Safayenikoo, J. Ma, T. A. Khan, and A. Quinn, “Cxlmemsim: A pure software simulated cxl. mem for performance characterization,” arXiv preprint arXiv:2303.06153, 2023.
- “tinymembench: a simple benchmark for memory throughput and latency,” https://github.com/ssvb/tinymembench.
- J. D. McCalpin et al., “Memory bandwidth and machine balance in current high performance computers,” IEEE computer society technical committee on computer architecture (TCCA) newsletter, vol. 2, no. 19-25, 1995.
- Papi. Available: http://icl.cs.utk.edu/papi/
- Pytorch benchmark. Available: https://github.com/pytorch/benchmark
- M. Douze et al., “The faiss library,” arXiv, 2024.
- “Milvus - Vector database built for scalable similarity search,” https://milvus.io/.
- “sift1M dataset,” https://www.tensorflow.org/datasets/catalog/sift1m.