Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling (2308.06077v3)

Published 11 Aug 2023 in cs.CL

Abstract: Generative LLMs (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural language prompts for an LM, from whose output the solution can then be extracted. LM performance has consistently been increasing with model size - but so has the monetary cost of querying the ever larger models. Importantly, however, not all inputs are equally hard: some require larger LMs for obtaining a satisfactory solution, whereas for others smaller LMs suffice. Based on this fact, we design a framework for cost-effective LLM choice, called "Fly-swat or cannon" (FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously assigns each input to an LM predicted to do well on the input according to a so-called meta-model, aiming to achieve high overall performance at low cost. The cost-performance tradeoff can be flexibly tuned by the user. Options include, among others, maximizing total expected performance (or the number of processed inputs) while staying within a given cost budget, or minimizing total cost while processing all inputs. We evaluate FORC on 14 datasets covering five natural language tasks, using four candidate LMs of vastly different size and cost. With FORC, we match the performance of the largest available LM while achieving a cost reduction of 63%. Via our publicly available library, researchers as well as practitioners can thus save large amounts of money without sacrificing performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Towards Efficient Post-training Quantization of Pre-trained Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 1405–1418. https://proceedings.neurips.cc/paper_files/paper/2022/file/096347b4efc264ae7f07742fea34af1f-Paper-Conference.pdf
  2. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  3. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176 [cs.LG]
  4. How is ChatGPT’s behavior changing over time? arXiv:2307.09009 [cs.CL]
  5. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
  6. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628
  7. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs.CV]
  8. Knowledge Distillation: A Survey. International Journal of Computer Vision 129, 6 (mar 2021), 1789–1819. https://doi.org/10.1007/s11263-021-01453-z
  9. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research 22, 241 (2021), 1–124. http://jmlr.org/papers/v22/21-0366.html
  10. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
  11. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372
  12. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
  13. ZipLM: Hardware-Aware Structured Pruning of Language Models. arXiv:2302.04089 [cs.LG]
  14. Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL]
  15. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  16. The Inverse Scaling Prize. https://github.com/inverse-scaling/prize
  17. Inverse Scaling Prize: First Round Winners. https://irmckenzie.co.uk/round1
  18. Inverse Scaling Prize: Second Round Winners. https://irmckenzie.co.uk/round2
  19. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]
  20. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
  21. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  22. Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements. arXiv:2210.01970 [cs.LG]
  23. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 38087–38099. https://proceedings.mlr.press/v202/xiao23c.html
  24. Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks. arXiv:2301.07006 [cs.CL]
Citations (33)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com