Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models (2310.09949v3)
Abstract: A Retrieval-Augmented LLM (RALM) augments a generative LLM by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations such as model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a heterogeneous accelerator system that integrates both LM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. These promising results pave the way for bringing accelerator heterogeneity and disaggregation into future RALM systems.
- Faiss. https://github.com/facebookresearch/faiss/.
- The implications of openai’s latest update on rag and vector-only databases. https://medium.com/@vishalkalia.er/the-implications-of-openais-latest-update-on-rag-and-vector-only-databases-c3f326cce0a1.
- What does openai’s announcement mean for retrieval augmented generation (rag) and vector-only databases? https://medium.com/madhukarkumar/what-does-openais-announcement-mean-for-retrieval-augmented-generation-rag-and-vector-only-54bfc34cba2c.
- Loggp: Incorporating long messages into the logp model—one step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pages 95–105, 1995.
- Application-transparent near-memory processing architecture with memory channel network. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 802–814. IEEE, 2018.
- Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning, pages 468–485. PMLR, 2022.
- Can far memory improve job throughput? In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
- Tensor slices to the rescue: Supercharging ml acceleration on fpgas. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 23–33, 2021.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- When machine unlearning jeopardizes privacy. In Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pages 896–911, 2021.
- Spann: Highly-efficient billion-scale approximate nearest neighbor search. arXiv preprint arXiv:2111.08566, 2021.
- {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
- Vector and line quantization for billion-scale similarity search on gpus. Future Generation Computer Systems, 99:295–307, 2019.
- Robustiq: A robust ann search method for billion-scale similarity search on gpus. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 132–140, 2019.
- Tpu-knn: K nearest neighbor search at peak flop/s. arXiv preprint arXiv:2206.14286, 2022.
- Serving heterogeneous machine learning models on {{\{{Multi-GPU}}\}} servers with {{\{{Spatio-Temporal}}\}} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Logp: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 1–12, 1993.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Query processing on smart ssds: Opportunities and challenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1221–1230, 2013.
- {{\{{FaRM}}\}}: Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 401–414, 2014.
- Natsa: a near-data processing accelerator for time series analysis. In 2020 IEEE 38th International Conference on Computer Design (ICCD), pages 120–129. IEEE, 2020.
- Dgsf: Disaggregated gpus for serverless functions. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 739–750. IEEE, 2022.
- A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2018.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Hrl: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 126–137. Ieee, 2016.
- Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–764, 2017.
- Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013.
- Biscuit: A framework for near-data processing of big data workloads. ACM SIGARCH Computer Architecture News, 44(3):153–165, 2016.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
- A^ 3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 328–341. IEEE, 2020.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- Easynet: 100 gbps network for hls. In 2021 31th International Conference on Field Programmable Logic and Applications (FPL), 2021.
- Low-overhead loggp parameter assessment for modern interconnection networks. In 2007 IEEE International Parallel and Distributed Processing Symposium, pages 1–8. IEEE, 2007.
- Energy, memory, and runtime tradeoffs for implementing collective communication operations. Supercomputing frontiers and innovations, 1(2):58–75, 2014.
- Ice: An intelligent cognition engine with 3d nand-based in-memory computing for vector similarity search acceleration. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 763–783. IEEE, 2022.
- A scalable, high-performance customized priority queue. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pages 1–4. IEEE, 2014.
- Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
- Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- {{\{{CXL-ANNS}}\}}:{{\{{Software-Hardware}}\}} collaborative memory disaggregation and computation for {{\{{Billion-Scale}}\}} approximate nearest neighbor search. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 585–600, 2023.
- Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems, 32, 2019.
- Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
- Microrec: efficient recommendation inference by hardware and data structure solutions. Proceedings of Machine Learning and Systems, 3:845–859, 2021.
- Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3097–3105, 2021.
- Co-design hardware and algorithm for vector search. arXiv preprint arXiv:2306.11182, 2023.
- Yoursql: a high-performance database system leveraging in-storage computing. Proceedings of the VLDB Endowment, 9(12):924–935, 2016.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
- Bluedbm: An appliance for big data analytics. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pages 1–13. IEEE, 2015.
- Enabling cost-effective data processing with smart ssd. In 2013 IEEE 29th symposium on mass storage systems and technologies (MSST), pages 1–12. IEEE, 2013.
- A learned performance model for tensor processing units. Proceedings of Machine Learning and Systems, 3:387–400, 2021.
- Disaggrec: Architecting disaggregated systems for large-scale personalized recommendation. arXiv preprint arXiv:2212.00939, 2022.
- Nearest neighbor machine translation. arXiv preprint arXiv:2010.00710, 2020.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
- Summarizer: trading communication with computing near storage. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 219–231, 2017.
- Farview: Disaggregated memory with operator off-loading for database engines. arXiv preprint arXiv:2106.07102, 2021.
- Heterogeneous dataflow accelerators for multi-dnn workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71–83. IEEE, 2021.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 2–14. IEEE, 2021.
- Anna: Specialized architecture for approximate nearest neighbor search. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 169–183. IEEE, 2022.
- Charles E Leiserson. Systolic priority queues. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, 1979.
- Nv-tree: An efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):869–883, 2008.
- Pre-training via paraphrasing. Advances in Neural Information Processing Systems, 33:18470–18481, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Efficient quantized sparse matrix operations on tensor cores. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
- Efficient activation quantization via adaptive rounding border for post-training quantization. arXiv preprint arXiv:2208.11945, 2022.
- Zihao Li. The dark side of chatgpt: Legal and ethical challenges from stochastic parrots and hallucination. arXiv preprint arXiv:2304.14347, 2023.
- Decoupled context processing for context augmented language modeling. Advances in Neural Information Processing Systems, 35:21698–21710, 2022.
- Disaggregated memory for expansion and sharing in blade servers. ACM SIGARCH computer architecture news, 37(3):267–278, 2009.
- System-level implications of disaggregated memory. In IEEE International Symposium on High-Performance Comp Architecture, pages 1–12. IEEE, 2012.
- Deepstore: In-storage acceleration for intelligent queries. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 224–238, 2019.
- Genstore: a high-performance in-storage processing system for genome sequence analysis. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 635–654, 2022.
- Fast nearest neighbor machine translation. arXiv preprint arXiv:2105.14528, 2021.
- A compiler infrastructure for accelerator generators. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 804–817, 2021.
- Active pages: A computation model for intelligent memory. In Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No. 98CB36235), pages 192–203. IEEE, 1998.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- A case for intelligent ram. IEEE micro, 17(2):34–44, 1997.
- Cfu playground: Full-stack open-source framework for tiny machine learning (tinyml) acceleration on fpgas. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 157–167. IEEE, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
- Hm-ann: Efficient billion-point nearest neighbor search on heterogeneous memory. Advances in Neural Information Processing Systems, 33:10672–10684, 2020.
- Active disks for large-scale data processing. Computer, 34(6):68–74, 2001.
- End-to-end training of neural retrievers for open-domain question answering. arXiv preprint arXiv:2101.00408, 2021.
- Willow: A {{\{{User-Programmable}}\}}{{\{{SSD}}\}}. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 67–80, 2014.
- Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 14–27, 2019.
- From high-level deep neural models to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12. IEEE, 2016.
- Near-memory computing: Past, present, and future. Microprocessors and Microsystems, 71:102868, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Active flash: Towards {{\{{Energy-Efficient}}\}},{{\{{In-Situ}}\}} data analytics on {{\{{Extreme-Scale}}\}} machines. In 11th USENIX Conference on File and Storage Technologies (FAST 13), pages 119–132, 2013.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Lightseq2: Accelerated training for transformer-based models on gpus. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2022.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Efficient large-scale approximate nearest neighbor search on the gpu. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2027–2035, 2016.
- Recssd: near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 717–729, 2021.
- Why do nearest neighbor language models work? arXiv preprint arXiv:2301.02828, 2023.
- Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9:362–373, 2021.
- {{\{{FAERY}}\}}: An {{\{{FPGA-accelerated}}\}} embedding-based retrieval system. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 841–856, 2022.
- Df-gas: a distributed fpga-as-a-service architecture towards billion-scale graph-based approximate nearest neighbor search. 2023.
- Efficient large-scale approximate nearest neighbor search on opencl fpga. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4924–4932, 2018.
- Understanding the effect of data center resource disaggregation on production dbmss. Proceedings of the VLDB Endowment, 13(9), 2020.
- Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12. IEEE, 2016.
- Algorithm-hardware co-design of attention mechanism on fpga devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s):1–24, 2021.
- Wenqi Jiang (15 papers)
- Marco Zeller (1 paper)
- Roger Waleffe (11 papers)
- Torsten Hoefler (203 papers)
- Gustavo Alonso (45 papers)