Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native (2401.12230v1)
Abstract: In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures. Recent large models such as ChatGPT, while revolutionary in their capabilities, face challenges like escalating costs and demand for high-end GPUs. Drawing analogies between large-model-as-a-service (LMaaS) and cloud database-as-a-service (DBaaS), we describe an AI-native computing paradigm that harnesses the power of both cloud-native technologies (e.g., multi-tenancy and serverless computing) and advanced machine learning runtime (e.g., batched LoRA inference). These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility. The journey of merging these two domains is just at the beginning and we hope to stimulate future research and development in this area.
- [n. d.]. Pinecone. ([n. d.]). https://www.pinecone.io/
- 2023. How Microsoft is Trying to Lessen Its Addiction to OpenAI as AI Costs Soar. https://www.theinformation.com/articles/how-microsoft-is-trying-to-lessen-its-addiction-to-openai-as-ai-costs-soar. (2023).
- 2023. LLM Limitations (https://zilliz.com/use-cases/llm-retrieval-augmented-generation). (2023).
- TensorFlow: A System for Large-Scale Machine Learning. In Symposium on Operating Systems Design and Implementation (OSDI).
- Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
- {{\{{CherryPick}}\}}: Adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 469–482.
- Amazon EC2 Spot Instances 2023. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot. (2023). Accessed: 2023-10-21.
- Amazon EC2 Spot Instances Pricing 2023. Amazon EC2 Spot Instances Pricing. https://aws.amazon.com/ec2/spot/pricing/. (2023). Accessed: 2023-10-21.
- Amazon EKS 2023. Amazon EKS. https://aws.amazon.com/eks/. (2023). Accessed: 2023-10-21.
- Amazon RDS 2023. Amazon RDS. https://aws.amazon.com/rds/. (2023). Accessed: 2023-10-21.
- Socrates: The new sql server in the cloud. In Proceedings of the 2019 International Conference on Management of Data. 1743–1756.
- AWS Lambda 2023. AWS Lambda. https://aws.amazon.com/lambda/. (2023). Accessed: 2023-10-21.
- Azure Functions 2023. Azure Functions. https://azure.microsoft.com/en-us/products/functions. (2023). Accessed: 2023-10-21.
- Azure Kubernetes Service 2023. Azure Kubernetes Service. https://azure.microsoft.com/en-us/products/kubernetes-service. (2023). Accessed: 2023-10-21.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 141–159.
- JAX: composable transformations of Python+NumPy programs. (2018). http://github.com/google/jax
- Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).
- An overview of business intelligence technology. Commun. ACM 54, 8 (2011), 88–98.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).
- Punica: Multi-Tenant LoRA Serving. arXiv preprint arXiv:2310.18547.
- Andrew A. Chien. 2021. Driving the Cloud to True Zero Carbon. Commun. ACM 64, 2 (2021), 5.
- The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215–226.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
- Databricks 2023. Databricks. https://www.databricks.com/. (2023). Accessed: 2023-10-21.
- DeepSpeed-MII 2023. DeepSpeed-MII. https://github.com/microsoft/DeepSpeed-MII. (2023). Accessed: 2023-10-21.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
- An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.
- Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX symposium on networked systems design and implementation (NSDI 11).
- Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record 30, 2 (2001), 331–342.
- Google Cloud Functions 2023. Google Cloud Functions. https://cloud.google.com/functions?hl=en. (2023). Accessed: 2023-10-21.
- Google Kubernetes Engine 2023. Google Kubernetes Engine. https://cloud.google.com/kubernetes-engine?hl=en. (2023). Accessed: 2023-10-21.
- Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462.
- Amazon redshift and the case for simpler data warehouses. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1917–1923.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
- Mesos: A platform for {{\{{Fine-Grained}}\}} resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11).
- MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. ArXiv abs/2308.00352 (2023). https://api.semanticscholar.org/CorpusID:260351380
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
- Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 symposium on cloud computing. 445–451.
- Cloud programming simplified: A berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).
- Kubernetes 2023. Kubernetes. https://kubernetes.io/. (2023). Accessed: 2023-10-21.
- Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180 (2023).
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920–1933.
- AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv: 2308.03688 (2023).
- Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data. 631–645.
- Themis: Fair and efficient {{\{{GPU}}\}} cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 289–304.
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
- High-Throughput Vector Similarity Search in Knowledge Graphs. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–25.
- Naiad: A Timely Dataflow System. In ACM Symposium on Operating Systems Principles (SOSP).
- PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
- {{\{{Heterogeneity-Aware}}\}} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 481–498.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
- Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Survey of Vector Database Management Systems. CoRR abs/2305.01087 (2023).
- Generative Agents: Interactive Simulacra of Human Behavior. (2023). arXiv:cs.HC/2304.03442
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019).
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021).
- Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. 1–14.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning. PMLR, 18332–18346.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- Serverless Computing: A Survey of Opportunities, Challenges, and Applications. ACM Computing Survey 54, 11s (2022), 239:1–239:32.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
- Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In ACM Symposium on Operating Systems Principles (SOSP).
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. (2023).
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).
- Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. (2023). arXiv:cs.AI/2306.03314
- TGI 2023. TGI. https://github.com/huggingface/text-generation-inference. (2023). Accessed: 2023-10-21.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM international conference on management of data. 1009–1024.
- Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. 374–389.
- Ernest: Efficient performance prediction for {{\{{Large-Scale}}\}} advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 363–378.
- Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 2614–2627.
- Gemini: Fast failure recovery in distributed training with in-memory checkpoints. (2023).
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv:cs.AI/2308.08155
- Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures. arXiv preprint arXiv:2205.04713 (2022).
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
- Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems 35 (2022), 25464–25477.
- Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 423–438.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).