The economic trade-offs of large language models: A case study (2306.07402v1)
Abstract: Contacting customer service via chat is a common practice. Because employing customer service agents is expensive, many companies are turning to NLP that assists human agents by auto-generating responses that can be used directly or with modifications. LLMs are a natural fit for this use case; however, their efficacy must be balanced with the cost of training and serving them. This paper assesses the practical cost and impact of LLMs for the enterprise as a function of the usefulness of the responses that they generate. We present a cost framework for evaluating an NLP model's utility for this use case and apply it to a single brand as a case study in the context of an existing agent assistance product. We compare three strategies for specializing an LLM - prompt engineering, fine-tuning, and knowledge distillation - using feedback from the brand's customer service agents. We find that the usability of a model's responses can make up for a large difference in inference cost for our case study brand, and we extrapolate our findings to the broader enterprise space.
- Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
- Plato-xl: Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107–118.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Cohere. 2023a. Generation. https://docs.cohere.ai/docs/generation-card. Accessed: 2023-02-16.
- Cohere. 2023b. Pricing. https://cohere.ai/pricing. Accessed: 2023-02-16.
- Cohere. 2023c. Prompt engineering. https://docs.cohere.ai/docs/prompt-engineering. Accessed: 2023-02-16.
- Cohere. 2023d. Training custom models. https://docs.cohere.ai/docs/training-custom-models. Accessed: 2023-02-16.
- Google. Google cloud pricing calculator. https://cloud.google.com/products/calculator. Accessed: 2023-02-17.
- Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44(3):243–254.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28:1135–1143.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
- Domain-specific knowledge distillation yields smaller and better models for conversational commerce. ECNLP 2022, page 151.
- Huggingface. Export to onnx. https://huggingface.co/docs/transformers/serialization. Accessed: 2023-02-17.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
- NVIDIA. a. Performance analyzer. https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md. Accessed: 2023-02-17.
- NVIDIA. b. Triton inference server. https://github.com/triton-inference-server/server. Accessed: 2023-02-17.
- OpenAI. 2023a. Models: Gpt-3. https://platform.openai.com/docs/models/gpt-3. Accessed: 2023-02-16.
- OpenAI. 2023b. Pricing. https://openai.com/api/pricing/. Accessed: 2023-02-16.
- Godel: Large-scale pre-training for goal-directed dialog. arXiv preprint arXiv:2206.11309.
- Soloist: Building task bots at scale with transfer learning and machine teaching. Transactions of the Association for Computational Linguistics, 9:907–824.
- R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
- Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
- Minho Ryu and Kichun Lee. 2020. Knowledge distillation for BERT unsupervised domain adaptation. arXiv preprint arXiv:2010.11478.
- Victor Sanh. 2023. Huggingface distillation documentation. https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md. Accessed: 2023-02-16.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS EMC22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Workshop.
- Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, pages 8815–8821.
- Jessica Shieh. 2022. Best practices for prompt engineering with openai api. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api. Accessed: 2023-02-16.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Attention is all you need. Advances in neural information processing systems, 30.
- Vigilance requires hard mental work and is stressful. Human factors, 50:433–41.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Edward WD Whittaker and Bhiksha Raj. 2001. Quantization-based language model compression. In Seventh European Conference on Speech Communication and Technology.
- DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.