Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI's LLM with Open Source SLMs in Production (2312.14972v3)

Published 20 Dec 2023 in cs.SE, cs.AI, and cs.LG

Abstract: Many companies use LLMs offered as a service, like OpenAI's GPT-4, to create AI-enabled product experiences. Along with the benefits of ease-of-use and shortened time-to-solution, this reliance on proprietary services has downsides in model control, performance reliability, uptime predictability, and cost. At the same time, a flurry of open-source small LLMs (SLMs) has been made available for commercial use. However, their readiness to replace existing capabilities remains unclear, and a systematic approach to holistically evaluate these SLMs is not readily available. This paper presents a systematic evaluation methodology and a characterization of modern open-source SLMs and their trade-offs when replacing proprietary LLMs for a real-world product feature. We have designed SLaM, an open-source automated analysis tool that enables the quantitative and qualitative testing of product features utilizing arbitrary SLMs. Using SLaM, we examine the quality and performance characteristics of modern SLMs relative to an existing customer-facing implementation using the OpenAI GPT-4 API. Across 9 SLMs and their 29 variants, we observe that SLMs provide competitive results, significant performance consistency improvements, and a cost reduction of 5x~29x when compared to GPT-4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. “OpenAI blog: ChatGPT,” https://openai.com/blog/chatgpt, 2023.
  2. M. Dowling and B. Lucey, “Chatgpt for (finance) research: The bananarama conjecture,” Finance Research Letters, vol. 53, p. 103662, 2023.
  3. OpenAI. (Year Accessed) Introducing ChatGPT and Whisper APIs. [Online]. Available: https://openai.com/blog/introducing-chatgpt-and-whisper-apis
  4. The Register. (2023) Outage hits OpenAI’s ChatGPT as servers stopped by ’Claude’ glitch. [Online]. Available: https://www.theregister.com/2023/11/08/outage_chatgpt_openai_claude/
  5. H. Shen, H. Chang, B. Dong, Y. Luo, and H. Meng, “Efficient llm inference on cpus,” arXiv preprint arXiv:2311.00502, 2023.
  6. H. Chang, H. Shen, Y. Cai, X. Ye, Z. Xu, W. Cheng, K. Lv, W. Zhang, Y. Lu, and H. Guo, “Effective quantization for diffusion models on cpus,” arXiv preprint arXiv:2311.16133, 2023.
  7. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  8. G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y. Liu, “Openchat: Advancing open-source language models with mixed-quality data,” arXiv preprint arXiv:2309.11235, 2023.
  9. S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” arXiv preprint arXiv:2306.02707, 2023.
  10. Y. Chang and at el., “A survey on evaluation of large language models,” 2023.
  11. M. Ollivier and at el., “A deeper dive into chatgpt: history, use and future perspectives for orthopaedic research,” Knee Surgery, Sports Traumatology, Arthroscopy: Official Journal of the ESSKA, vol. 31, no. 4, pp. 1190–1192, 2023.
  12. A. Vaswani and at el., “Attention is all you need,” 2023.
  13. A. Radford and at el., “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533
  14. M. Shoeybi and at el., “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020.
  15. OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023. [Online]. Available: https://arxiv.org/abs/2303.08774
  16. “Join us for openai’s first developer conference,” 2023. [Online]. Available: https://openai.com/blog/announcing-openai-devday
  17. Z. L. et al., “Llm-qat: Data-free quantization aware training for large language models,” 2023.
  18. J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, and B. Zhuang, “Qllm: Accurate and efficient low-bitwidth quantization for large language models,” 2023.
  19. L. AI, “Langchain: Building applications with llms through composability,” https://github.com/langchain-ai/langchain, 2023, accessed: Dec. 15, 2023.
  20. “Emnlp: Prompt engineering is the new feature engineering,” 2022. [Online]. Available: https://www.amazon.science/blog/emnlp-prompt-engineering-is-the-new-feature-engineering
  21. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, jan 2023. [Online]. Available: https://doi.org/10.1145/3560815
  22. “Gpt-3 powers the next generation of apps,” 2023. [Online]. Available: https://openai.com/blog/gpt-3-apps
  23. “Openai: why are the api calls so slow? when will it be fixed,” 2023. [Online]. Available: https://community.openai.com/t/openai-why-are-the-api-calls-so-slow-when-will-it-be-fixed/148339/68?page=4
  24. HuggingFace, “Hugging Face Models,” https://huggingface.co/models, 2023, accessed: Dec. 15, 2023.
  25. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” CoRR, vol. abs/1908.10084, 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
  26. D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018.
  27. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  28. HuggingFaceH4, “Open LLM Leaderboard,” https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023, accessed: Dec. 15, 2023.
  29. W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2023, Raleigh, NC, USA, April 23-25, 2023.   IEEE, 2023, pp. 283–294. [Online]. Available: https://doi.org/10.1109/ISPASS57527.2023.00035
  30. D. Moolchandani, J. Kundu, F. Ruelens, P. Vrancx, T. Evenblij, and M. Perumkunnil, “Amped: An analytical model for performance in distributed training of transformers,” in IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2023, Raleigh, NC, USA, April 23-25, 2023.   IEEE, 2023, pp. 306–315. [Online]. Available: https://doi.org/10.1109/ISPASS57527.2023.00037
  31. J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating machine learningworkloads on memory-centric computing systems,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 35–49.
  32. H. Kwon, K. Nair, J. Seo, J. Yik, D. Mohapatra, D. Zhan, J. Song, P. Capak, P. Zhang, P. Vajda et al., “Xrbench: An extended reality (xr) machine learning benchmark suite for the metaverse,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  33. H. Kwon, L. Lai, M. Pellauer, T. Krishna, Y.-H. Chen, and V. Chandra, “Heterogeneous dataflow accelerators for multi-dnn workloads,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2021, pp. 71–83.
  34. C. Xu and at el., “Small models are valuable plug-ins for large language models,” ArXiv, vol. abs/2305.08848, 2023.
  35. Z. Liang, J. Cheng, R. Yang, H. Ren, Z. Song, D. Wu, X. Qian, T. Li, and Y. Shi, “Unleashing the potential of llms for quantum computing: A study in quantum architecture design,” arXiv preprint arXiv:2307.08191, 2023.
  36. Z. Wu, L. Y. Dai, A. Novick, M. Glick, Z. Zhu, S. Rumley, G. Michelogiannakis, J. Shalf, and K. Bergman, “Peta-scale embedded photonics architecture for distributed deep learning applications,” Journal of Lightwave Technology, 2023.
  37. F. Blanaru, A. Stratikopoulos, J. Fumero, and C. Kotselidis, “Enabling pipeline parallelism in heterogeneous managed runtime environments via batch processing,” in Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2022, pp. 58–71.
  38. J. Song, J. Yim, J. Jung, H. Jang, H.-J. Kim, Y. Kim, and J. Lee, “Optimus-cc: Efficient large nlp model training with 3d parallelism aware communication compression,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 560–573.
  39. S.-C. Kao, S. Subramanian, G. Agrawal, A. Yazdanbakhsh, and T. Krishna, “Flat: An optimized dataflow for mitigating attention bottlenecks,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 295–310.
  40. A. Yazdanbakhsh and et al., “Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   Los Alamitos, CA, USA: IEEE Computer Society, oct 2022, pp. 744–762. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MICRO56248.2022.00059
  41. A. Samajdar, M. Pellauer, and T. Krishna, “Self-adaptive reconfigurable arrays (sara): Using ml to assist scaling gemm acceleration,” arXiv preprint arXiv:2101.04799, 2021.
  42. S. Wang and at el., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS 2023.   Association for Computing Machinery, 2022, p. 93–106.
  43. D. Rouhani and et al., “With shared microexponents, a little shifting goes a long way,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23.   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3579371.3589351
  44. H. Zhang, A. Ning, R. Prabhakar, and D. Wentzlaff, “A hardware evaluation framework for large language model inference,” 2023.
  45. G. Jeong, S. Damani, A. R. Bambhaniya, E. Qin, C. J. Hughes, S. Subramoney, H. Kim, and T. Krishna, “Vegeta: Vertically-integrated extensions for sparse/dense gemm tile acceleration on cpus,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2023, pp. 259–272.
  46. N. Jouppi and et al., “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23.   Association for Computing Machinery, 2023.
  47. S. Hong and at el., “Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation,” in 2022 IEEE Hot Chips 34 Symposium (HCS), 2022, pp. 1–17.
  48. C. Guo and et al., “Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23.   ACM, Jun. 2023. [Online]. Available: http://dx.doi.org/10.1145/3579371.3589038
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Chandra Irugalbandara (3 papers)
  2. Ashish Mahendra (4 papers)
  3. Roland Daynauth (6 papers)
  4. Tharuka Kasthuri Arachchige (1 paper)
  5. Krisztian Flautner (6 papers)
  6. Lingjia Tang (15 papers)
  7. Yiping Kang (8 papers)
  8. Jason Mars (21 papers)
  9. Jayanaka Dantanarayana (2 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com