Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability (2405.17147v1)

Published 27 May 2024 in cs.MM

Abstract: The rapid advancement of LLMs has significantly impacted human-computer interaction, epitomized by the release of GPT-4o, which introduced comprehensive multi-modality capabilities. In this paper, we first explored the deployment strategies, economic considerations, and sustainability challenges associated with the state-of-the-art LLMs. More specifically, we discussed the deployment debate between Retrieval-Augmented Generation (RAG) and fine-tuning, highlighting their respective advantages and limitations. After that, we quantitatively analyzed the requirement of xPUs in training and inference. Additionally, for the tokenomics of LLM services, we examined the balance between performance and cost from the quality of experience (QoE)'s perspective of end users. Lastly, we envisioned the future hybrid architecture of LLM processing and its corresponding sustainability concerns, particularly in the environmental carbon footprint impact. Through these discussions, we provided a comprehensive overview of the operational and strategic considerations essential for the responsible development and deployment of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. OpenAI. (2024). [Online]. Available: https://openai.com/index/hello-gpt-4o/
  2. Z. Long, H. Dong, and A. El Saddik, “Interacting with New York city data by Hololens through remote rendering,” IEEE Consumer Electronics Magazine, vol. 11, no. 5, pp. 64–72, 2022.
  3. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  4. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “BloombergGPT: A Large Language Model for Finance,” arXiv preprint arXiv:2303.17564, 2023.
  5. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.
  6. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  7. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  8. J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 754–17 762.
  9. Z. Cao, W. Li, S. Li, and F. Wei, “Retrieve, rerank and rewrite: Soft template based neural summarization,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 152–161.
  10. A. Balaguer, V. Benara, R. L. de Freitas Cunha, R. d. M. Estevão Filho, T. Hendry, D. Holstein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes et al., “Rag vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture,” arXiv e-prints, pp. arXiv–2401, 2024.
  11. OpenAI. (2023). [Online]. Available: https://openai.com/index/gpt-4-research/
  12. B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with GPT-4,” arXiv preprint arXiv:2304.03277, 2023.
  13. D. Patel and D. Nishball. (2024) Groq inference tokenomics: Speed, but at what cost? [Online]. Available: https://www.semianalysis.com/p/groq-inference-tokenomics-speed-but
  14. H. Dong and Y. Liu, “Metaverse meets consumer electronics,” IEEE Consumer Electronics Magazine, vol. 12, no. 3, pp. 17–19, 2023.
  15. Z. Long, H. Dong, and A. El Saddik, “Human-centric resource allocation for the metaverse with multiaccess edge computing,” IEEE Internet of Things Journal, vol. 10, no. 22, pp. 19 993–20 005, 2023.
  16. H. Dong, Y. Liu, T. Chu, and A. El Saddik, “Bringing robots home: The rise of AI robots in consumer electronics,” IEEE Consumer Electronics Magazine, vol. 13, 2024.
  17. “The future of AI is hybrid,” Qualcomm, Tech. Rep., 2023.
  18. U. Gupta, Y. G. Kim, S. Lee, J. Tse, H.-H. S. Lee, G.-Y. Wei, D. Brooks, and C.-J. Wu, “Chasing carbon: The elusive environmental footprint of computing,” in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture.   IEEE, 2021, pp. 854–867.
  19. A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres, “Quantifying the carbon emissions of machine learning,” arXiv preprint arXiv:1910.09700, 2019.
  20. A. Faiz, S. Kaneda, R. Wang, R. Osi, P. Sharma, F. Chen, and L. Jiang, “LLMCarbon: Modeling the end-to-end carbon footprint of large language models,” arXiv preprint arXiv:2309.14393, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Haiwei Dong (26 papers)
  2. Shuang Xie (2 papers)
Citations (4)