Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings (2405.16820v1)

Published 27 May 2024 in cs.LG, cs.AI, cs.CY, and cs.HC

Abstract: The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. , , 308–318.
  2. The Falcon Series of Open Language Models. arXiv preprint arXiv:2311.16867 , (2023), .
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 , (2022), .
  4. Jeff Beckman. 2023. OpenAI Statistics 2023: Growth, Users, and More. https://techreport.com/statistics/openai-statistics/. [Accessed 19-01-2024].
  5. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 2291–2302. https://aclanthology.org/2023.eacl-main.168
  6. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. , , 610–623.
  7. Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com.
  8. The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. , , 173–184.
  9. Ekaba Bisong and Ekaba Bisong. 2019. An overview of google cloud platform services. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners , (2019), 7–10.
  10. The foundation model transparency index. arXiv preprint arXiv:2310.12941 , (2023), .
  11. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  12. Yaroslav Bulatov. 2018. Fitting larger networks into memory. https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9. [Accessed 19-01-2024].
  13. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. , , 2898–2904.
  14. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304 , (2023), .
  15. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 , (2016), .
  16. What are the best systems? New perspectives on NLP Benchmarking. arXiv preprint arXiv:2202.03799 , (2022), .
  17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. Advances in neural information processing systems 28, (2015), .
  18. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv preprint arXiv:2209.01390 , (2022), .
  19. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 4599–4610. https://doi.org/10.18653/v1/2021.naacl-main.365
  20. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 , (2022), .
  21. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 , (2023), .
  22. Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning. PMLR, , , 7750–7774.
  23. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614 , (2020), .
  24. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5, 3 (2023), 220–235.
  25. Aos Fatos. 2024. Aos Fatos. https://www.aosfatos.org/. [Accessed 22-01-2024].
  26. Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5, 12 (aug 2012), 2018–2019. https://doi.org/10.14778/2367502.2367564
  27. Deep learning with label differential privacy. Advances in neural information processing systems 34 (2021), 27131–27145.
  28. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (2023), e2305016120.
  29. A survey on differentially private machine learning. IEEE computational intelligence magazine 15, 2 (2020), 49–64.
  30. Huskyscribe at mediqa-sum 2023: Summarizing clinical dialogues with transformers. In . CLEF, , , .
  31. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]
  32. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , (2021), .
  33. Mistral 7B. arXiv preprint arXiv:2310.06825 , (2023), .
  34. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 14 (2021), 6421.
  35. ChatGPT: Jack of all trades, master of none. Information Fusion , (2023), 101861.
  36. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, , , 5637–5664.
  37. Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL]
  38. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proceedings of the 5th international conference on conversational user interfaces. , , 1–6.
  39. Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance. In Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, Chung-Chi Chen, Hiroya Takamura, Puneet Mathur, Remit Sawhney, Hen-Hsen Huang, and Hsin-Hsi Chen (Eds.). -, Macao, 74–80. https://aclanthology.org/2023.finnlp-1.7
  40. Meedan. 2024. Meedan. https://meedan.com/. [Accessed 22-01-2024].
  41. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  42. OpenAI. 2022. Introducing ChatGPT. OpenAI Blog , (Nov 2022), .
  43. OpenAI. 2024a. Models. https://platform.openai.com/docs/models/. [Accessed 19-01-2024].
  44. OpenAI. 2024b. Pricing. https://openai.com/pricing. [Accessed 19-01-2024].
  45. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  46. Using proprietary language models in academic research requires explicit justification. Nature Computational Science , (2023), 1–2.
  47. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. , , 311–318.
  48. Dylan Patel and Gerald Wong. 2023. GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE. https://www.semianalysis.com/p/gpt-4-architecture-infrastructure. [Accessed 19-01-2024].
  49. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 , (2023), .
  50. Privacy in the Time of Language Models. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. , , 1291–1292.
  51. Jon Porter. 2023. ChatGPT continues to be one of the fastest-growing services ever. The Verge , (Nov 2023), .
  52. Improving language understanding by generative pre-training. , (2018), .
  53. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  54. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 , (2023), .
  55. Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems , (2023), .
  56. Closed AI Models Make Bad Baselines. https://hackingsemantics.xyz/2023/closed-baselines/
  57. Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations. , , . https://research.ibm.com/publications/multitask-prompt-tuning-enables-zero-shot-task-generalization
  58. Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. , , 9252–9265.
  59. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , (2023), .
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , (2023), .
  61. Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334 , (2023), .
  62. Attention is all you need. Advances in neural information processing systems 30, (2017), .
  63. GPT-4 for triaging ophthalmic symptoms. Eye 37, 18 (2023), 3874–3875.
  64. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–37.
  65. Is ChatGPT a good sentiment analyzer? A preliminary study. arXiv preprint arXiv:2304.04339 , (2023), .
  66. Climatebert: A pretrained language model for climate-related text. arXiv preprint arXiv:2110.12010 , (2021), .
  67. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  68. Characterizing LLM Abstention Behavior in Science QA with Context Perturbations. arXiv preprint arXiv:2404.12452 (2024).
  69. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 , (2019), .
  70. dp-transformers: Training transformer models with differential privacy. https://www.microsoft.com/en-us/research/project/dp-transformers.
  71. Overview of the mediqa-sum task at imageclef 2023: Summarization and classification of doctor-patient conversations. In CLEF. , , .
  72. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298 , (2021), .
  73. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500 , (2021), .
  74. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). , , 1–9.
  75. Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology 16, 1 (Aug. 2016), 93. https://doi.org/10.1186/s12874-016-0200-9
  76. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. , , .
  77. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]
Citations (2)

Summary

We haven't generated a summary for this paper yet.