Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native (2401.12230v1)

Published 17 Jan 2024 in cs.DC and cs.LG

Abstract: In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures. Recent large models such as ChatGPT, while revolutionary in their capabilities, face challenges like escalating costs and demand for high-end GPUs. Drawing analogies between large-model-as-a-service (LMaaS) and cloud database-as-a-service (DBaaS), we describe an AI-native computing paradigm that harnesses the power of both cloud-native technologies (e.g., multi-tenancy and serverless computing) and advanced machine learning runtime (e.g., batched LoRA inference). These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility. The journey of merging these two domains is just at the beginning and we hope to stimulate future research and development in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. [n. d.]. Pinecone. ([n. d.]). https://www.pinecone.io/
  2. 2023. How Microsoft is Trying to Lessen Its Addiction to OpenAI as AI Costs Soar. https://www.theinformation.com/articles/how-microsoft-is-trying-to-lessen-its-addiction-to-openai-as-ai-costs-soar. (2023).
  3. 2023. LLM Limitations (https://zilliz.com/use-cases/llm-retrieval-augmented-generation). (2023).
  4. TensorFlow: A System for Large-Scale Machine Learning. In Symposium on Operating Systems Design and Implementation (OSDI).
  5. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
  6. {{\{{CherryPick}}\}}: Adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 469–482.
  7. Amazon EC2 Spot Instances 2023. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot. (2023). Accessed: 2023-10-21.
  8. Amazon EC2 Spot Instances Pricing 2023. Amazon EC2 Spot Instances Pricing. https://aws.amazon.com/ec2/spot/pricing/. (2023). Accessed: 2023-10-21.
  9. Amazon EKS 2023. Amazon EKS. https://aws.amazon.com/eks/. (2023). Accessed: 2023-10-21.
  10. Amazon RDS 2023. Amazon RDS. https://aws.amazon.com/rds/. (2023). Accessed: 2023-10-21.
  11. Socrates: The new sql server in the cloud. In Proceedings of the 2019 International Conference on Management of Data. 1743–1756.
  12. AWS Lambda 2023. AWS Lambda. https://aws.amazon.com/lambda/. (2023). Accessed: 2023-10-21.
  13. Azure Functions 2023. Azure Functions. https://azure.microsoft.com/en-us/products/functions. (2023). Accessed: 2023-10-21.
  14. Azure Kubernetes Service 2023. Azure Kubernetes Service. https://azure.microsoft.com/en-us/products/kubernetes-service. (2023). Accessed: 2023-10-21.
  15. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 141–159.
  16. JAX: composable transformations of Python+NumPy programs. (2018). http://github.com/google/jax
  17. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).
  18. An overview of business intelligence technology. Commun. ACM 54, 8 (2011), 88–98.
  19. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).
  20. Punica: Multi-Tenant LoRA Serving. arXiv preprint arXiv:2310.18547.
  21. Andrew A. Chien. 2021. Driving the Cloud to True Zero Carbon. Commun. ACM 64, 2 (2021), 5.
  22. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215–226.
  23. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
  24. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
  25. Databricks 2023. Databricks. https://www.databricks.com/. (2023). Accessed: 2023-10-21.
  26. DeepSpeed-MII 2023. DeepSpeed-MII. https://github.com/microsoft/DeepSpeed-MII. (2023). Accessed: 2023-10-21.
  27. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
  28. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.
  29. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX symposium on networked systems design and implementation (NSDI 11).
  30. Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record 30, 2 (2001), 331–342.
  31. Google Cloud Functions 2023. Google Cloud Functions. https://cloud.google.com/functions?hl=en. (2023). Accessed: 2023-10-21.
  32. Google Kubernetes Engine 2023. Google Kubernetes Engine. https://cloud.google.com/kubernetes-engine?hl=en. (2023). Accessed: 2023-10-21.
  33. Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462.
  34. Amazon redshift and the case for simpler data warehouses. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1917–1923.
  35. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
  36. Mesos: A platform for {{\{{Fine-Grained}}\}} resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11).
  37. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. ArXiv abs/2308.00352 (2023). https://api.semanticscholar.org/CorpusID:260351380
  38. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  39. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
  40. Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 symposium on cloud computing. 445–451.
  41. Cloud programming simplified: A berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).
  42. Kubernetes 2023. Kubernetes. https://kubernetes.io/. (2023). Accessed: 2023-10-21.
  43. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180 (2023).
  44. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286.
  45. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  46. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920–1933.
  47. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv: 2308.03688 (2023).
  48. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data. 631–645.
  49. Themis: Fair and efficient {{\{{GPU}}\}} cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 289–304.
  50. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
  51. High-Throughput Vector Similarity Search in Knowledge Graphs. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–25.
  52. Naiad: A Timely Dataflow System. In ACM Symposium on Operating Systems Principles (SOSP).
  53. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.
  54. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
  55. {{\{{Heterogeneity-Aware}}\}} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 481–498.
  56. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
  57. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017).
  58. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  59. Survey of Vector Database Management Systems. CoRR abs/2305.01087 (2023).
  60. Generative Agents: Interactive Simulacra of Human Behavior. (2023). arXiv:cs.HC/2304.03442
  61. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019).
  62. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021).
  63. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. 1–14.
  64. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  65. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning. PMLR, 18332–18346.
  66. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
  67. Serverless Computing: A Survey of Opportunities, Challenges, and Applications. ACM Computing Survey 54, 11s (2022), 239:1–239:32.
  68. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
  69. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In ACM Symposium on Operating Systems Principles (SOSP).
  70. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. (2023).
  71. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  72. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).
  73. Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. (2023). arXiv:cs.AI/2306.03314
  74. TGI 2023. TGI. https://github.com/huggingface/text-generation-inference. (2023). Accessed: 2023-10-21.
  75. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  76. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM international conference on management of data. 1009–1024.
  77. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. 374–389.
  78. Ernest: Efficient performance prediction for {{\{{Large-Scale}}\}} advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 363–378.
  79. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 2614–2627.
  80. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. (2023).
  81. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  82. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
  83. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv:cs.AI/2308.08155
  84. Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures. arXiv preprint arXiv:2205.04713 (2022).
  85. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
  86. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems 35 (2022), 25464–25477.
  87. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 423–438.
  88. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com