Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The economic trade-offs of large language models: A case study (2306.07402v1)

Published 8 Jun 2023 in cs.CL and cs.AI

Abstract: Contacting customer service via chat is a common practice. Because employing customer service agents is expensive, many companies are turning to NLP that assists human agents by auto-generating responses that can be used directly or with modifications. LLMs are a natural fit for this use case; however, their efficacy must be balanced with the cost of training and serving them. This paper assesses the practical cost and impact of LLMs for the enterprise as a function of the usefulness of the responses that they generate. We present a cost framework for evaluating an NLP model's utility for this use case and apply it to a single brand as a case study in the context of an existing agent assistance product. We compare three strategies for specializing an LLM - prompt engineering, fine-tuning, and knowledge distillation - using feedback from the brand's customer service agents. We find that the usability of a model's responses can make up for a large difference in inference cost for our case study brand, and we extrapolate our findings to the broader enterprise space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  2. Plato-xl: Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107–118.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Cohere. 2023a. Generation. https://docs.cohere.ai/docs/generation-card. Accessed: 2023-02-16.
  6. Cohere. 2023b. Pricing. https://cohere.ai/pricing. Accessed: 2023-02-16.
  7. Cohere. 2023c. Prompt engineering. https://docs.cohere.ai/docs/prompt-engineering. Accessed: 2023-02-16.
  8. Cohere. 2023d. Training custom models. https://docs.cohere.ai/docs/training-custom-models. Accessed: 2023-02-16.
  9. Google. Google cloud pricing calculator. https://cloud.google.com/products/calculator. Accessed: 2023-02-17.
  10. Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44(3):243–254.
  11. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28:1135–1143.
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  13. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
  14. Domain-specific knowledge distillation yields smaller and better models for conversational commerce. ECNLP 2022, page 151.
  15. Huggingface. Export to onnx. https://huggingface.co/docs/transformers/serialization. Accessed: 2023-02-17.
  16. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  17. NVIDIA. a. Performance analyzer. https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md. Accessed: 2023-02-17.
  18. NVIDIA. b. Triton inference server. https://github.com/triton-inference-server/server. Accessed: 2023-02-17.
  19. OpenAI. 2023a. Models: Gpt-3. https://platform.openai.com/docs/models/gpt-3. Accessed: 2023-02-16.
  20. OpenAI. 2023b. Pricing. https://openai.com/api/pricing/. Accessed: 2023-02-16.
  21. Godel: Large-scale pre-training for goal-directed dialog. arXiv preprint arXiv:2206.11309.
  22. Soloist: Building task bots at scale with transfer learning and machine teaching. Transactions of the Association for Computational Linguistics, 9:907–824.
  23. R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
  25. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
  26. Minho Ryu and Kichun Lee. 2020. Knowledge distillation for BERT unsupervised domain adaptation. arXiv preprint arXiv:2010.11478.
  27. Victor Sanh. 2023. Huggingface distillation documentation. https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md. Accessed: 2023-02-16.
  28. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS EMC22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Workshop.
  29. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, pages 8815–8821.
  30. Jessica Shieh. 2022. Best practices for prompt engineering with openai api. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api. Accessed: 2023-02-16.
  31. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  32. Attention is all you need. Advances in neural information processing systems, 30.
  33. Vigilance requires hard mental work and is stressful. Human factors, 50:433–41.
  34. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  35. Edward WD Whittaker and Bhiksha Raj. 2001. Quantization-based language model compression. In Seventh European Conference on Speech Communication and Technology.
  36. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com