Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators (2307.05532v1)

Published 8 Jul 2023 in cs.CL

Abstract: LLMs that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary LLM for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. The growing influence of industry in AI research. Science 379, 6635 (March 2023), 884–886. https://doi.org/10.1126/science.ade2420
  2. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, Virtual Event Canada, 610–623. https://doi.org/10.1145/3442188.3445922
  3. The protein data bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology 112, 3 (May 1977), 535–542. https://doi.org/10.1016/S0022-2836(77)80200-3
  4. Abeba Birhane. 2021. Algorithmic injustice: a relational ethics approach. Patterns 2, 2 (Feb. 2021), 100205. https://doi.org/10.1016/j.patter.2021.100205
  5. Abeba Birhane and Olivia Guest. 2021. Towards Decolonising Computational Sciences. Kvinder, Køn & Forskning 2 (2021), 60–73. https://doi.org/10.7146/kkf.v29i2.124899
  6. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). 1536–1546. https://doi.org/10.1109/WACV48630.2021.00158 ISSN: 2642-9381.
  7. Multimodal datasets: misogyny, pornography, and malignant stereotypes. https://doi.org/10.48550/arXiv.2110.01963 arXiv:2110.01963 [cs].
  8. Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century. Frontiers in Big Data 2 (2019). https://www.frontiersin.org/articles/10.3389/fdata.2019.00043
  9. Extracting Training Data from Large Language Models. https://doi.org/10.48550/arXiv.2012.07805 arXiv:2012.07805 [cs].
  10. Ten years after ImageNet: a 360° perspective on artificial intelligence. Royal Society Open Science 10, 3 (March 2023), 221414. https://doi.org/10.1098/rsos.221414 Publisher: Royal Society.
  11. Kenneth Ward Church and Valia Kordoni. 2022. Emerging Trends: SOTA-Chasing. Natural Language Engineering 28, 2 (March 2022), 249–269. https://doi.org/10.1017/S1351324922000043 Publisher: Cambridge University Press.
  12. Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning. https://doi.org/10.48550/arXiv.2208.02294 arXiv:2208.02294 [cs].
  13. Interactive Model Cards: A Human-Centered Approach to Model Documentation. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 427–439. https://doi.org/10.1145/3531146.3533108
  14. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848 ISSN: 1063-6919.
  15. Harry G. Frankfurt. 2009. On Bullshit. Princeton University Press. https://doi.org/10.1515/9781400826537
  16. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. https://doi.org/10.48550/arXiv.2101.00027 arXiv:2101.00027 [cs].
  17. Statement from the listed authors of Stochastic Parrots on the ”AI pause” letter. https://www.dair-institute.org/blog/letter-statement-March2023
  18. On Reproducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications. AI Magazine 39, 3 (Sept. 2018), 56–68. https://doi.org/10.1609/aimag.v39i3.2816 Number: 3.
  19. Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the Art: Reproducibility in Artificial Intelligence. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (April 2018). https://doi.org/10.1609/aaai.v32i1.11503 Number: 1.
  20. Transparency and reproducibility in artificial intelligence. Nature 586, 7829 (Oct. 2020), E14–E16. https://doi.org/10.1038/s41586-020-2766-y Number: 7829 Publisher: Nature Publishing Group.
  21. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. Advances in Neural Information Processing Systems 35 (Dec. 2022), 29217–29234. https://proceedings.neurips.cc/paper_files/paper/2022/hash/bc218a0c656e49d4b086975a9c785f47-Abstract-Datasets_and_Benchmarks.html
  22. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 560–575. https://doi.org/10.1145/3442188.3445918
  23. Ivan Illich. 1973. Tools for conviviality. Harper & Row, New York.
  24. W. Bradley Knox and Peter Stone. 2008. TAMER: Training an Agent Manually via Evaluative Reinforcement. In 2008 7th IEEE International Conference on Development and Learning. 292–297. https://doi.org/10.1109/DEVLRN.2008.4640845 ISSN: 2161-9476.
  25. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90. https://doi.org/10.1145/3065386
  26. Illustrating Reinforcement Learning from Human Feedback (RLHF). Hugging Face Blog (2022).
  27. Not Directly Stated, Not Explicitly Stored: Conversational Agents and the Privacy Threat of Implicit Information. In Adjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’21). Association for Computing Machinery, New York, NY, USA, 388–391. https://doi.org/10.1145/3450614.3463601
  28. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. https://openreview.net/forum?id=UoEw6KigkUn
  29. Clifford H. Lee and Elisabeth Soep. 2016. None But Ourselves Can Free Our Minds: Critical Computational Literacy as a Pedagogy of Resistance. Equity & Excellence in Education 49, 4 (Oct. 2016), 480–492. https://doi.org/10.1080/10665684.2016.1227157
  30. Trustworthy AI: From Principles to Practices. Comput. Surveys 55, 9 (Jan. 2023), 177:1–177:46. https://doi.org/10.1145/3555803
  31. Advances, challenges and opportunities in creating data for trustworthy AI. Nature Machine Intelligence 4, 8 (Aug. 2022), 669–677. https://doi.org/10.1038/s42256-022-00516-1 Number: 8 Publisher: Nature Publishing Group.
  32. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. http://arxiv.org/abs/2211.02001 arXiv:2211.02001 [cs].
  33. How open science helps researchers succeed. eLife 5 (July 2016), e16800. https://doi.org/10.7554/eLife.16800
  34. Data Statements: From Technical Concept to Community Practice. ACM Journal on Responsible Computing (2023). https://doi.org/10.1145/3594737 Just Accepted.
  35. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021). Association for Computational Linguistics, Online, 121–135. https://doi.org/10.18653/v1/2021.gem-1.11
  36. Dan McQuillan. 2022. Resisting AI: an anti-fascist approach to artificial intelligence. Bristol University Press, Bristol, UK. OCLC: on1328026349.
  37. Augmented Language Models: a Survey. https://doi.org/10.48550/arXiv.2302.07842 arXiv:2302.07842 [cs].
  38. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 220–229. https://doi.org/10.1145/3287560.3287596
  39. Crosslingual Generalization through Multitask Finetuning. https://doi.org/10.48550/arXiv.2211.01786 arXiv:2211.01786 [cs].
  40. Michael Muller and Angelika Strohmayer. 2022. Forgetting Practices in the Data Sciences. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3517644
  41. Collaboration challenges in building ML-enabled systems: communication, documentation, engineering, and process. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 413–425. https://doi.org/10.1145/3510003.3510209
  42. OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs].
  43. Training language models to follow instructions with human feedback. (2022), 68.
  44. Mohit Pandey. 2023. OpenAI Might Invite Legal Trouble. Analytics India Magazine (March 2023). https://analyticsindiamag.com/openai-might-invite-legal-trouble/
  45. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (Nov. 2021), 100336. https://doi.org/10.1016/j.patter.2021.100336
  46. EleutherAI: Going Beyond ”Open Science” to ”Science in the Open”. https://doi.org/10.48550/arXiv.2210.06413 arXiv:2210.06413 [cs].
  47. Anna Rogers. 2021. Changing the World by Changing the Data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2182–2194. https://doi.org/10.18653/v1/2021.acl-long.170
  48. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–15. https://doi.org/10.1145/3411764.3445518
  49. Inside the secret list of websites that make AI like ChatGPT sound smart. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
  50. Irene Solaiman. 2023. The Gradient of Generative AI Release: Methods and Considerations. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 111–122. https://doi.org/10.1145/3593013.3593981
  51. Arthur Spirling. 2023. Why open-source generative AI models are an ethical way forward for science. Nature 616, 7957 (April 2023), 413–413. https://doi.org/10.1038/d41586-023-01295-4 Bandiera_abtest: a Cg_type: World View Number: 7957 Publisher: Nature Publishing Group Subject_term: Ethics, Machine learning, Technology, Scientific community.
  52. LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971 [cs] version: 1.
  53. Highly accurate protein structure prediction for the human proteome. Nature 596, 7873 (Aug. 2021), 590–596. https://doi.org/10.1038/s41586-021-03828-1 Number: 7873 Publisher: Nature Publishing Group.
  54. Self-Instruct: Aligning Language Models with Self-Generated Instructions. https://doi.org/10.48550/arXiv.2212.10560 arXiv:2212.10560 [cs].
  55. Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (April 2018). https://doi.org/10.1609/aaai.v32i1.11485 Number: 1.
  56. BigScience Workshop. 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. https://doi.org/10.48550/arXiv.2211.05100
  57. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. https://doi.org/10.48550/arXiv.2304.01196 arXiv:2304.01196 [cs].
  58. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3520312.3534862
  59. Fine-Tuning Language Models from Human Preferences. https://doi.org/10.48550/arXiv.1909.08593 arXiv:1909.08593 [cs, stat].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andreas Liesenfeld (4 papers)
  2. Alianda Lopez (2 papers)
  3. Mark Dingemanse (3 papers)
Citations (74)

Summary

Openness in Instruction-Tuned LLMs

The paper "Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators" authored by Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse presents a critical examination of the degrees of openness within the field of instruction-tuned LLMs, specifically focusing on alternatives to proprietary models like ChatGPT. This paper evaluates the dimensions of openness that pertain to code, data, model weights, RLHF data, licensing, and documentation. It provides a snapshot of the current landscape of open-source models that challenge proprietary constraints by emphasizing the importance of transparent, accountable, and reproducible research practices in artificial intelligence development.

Key Findings

The paper identifies two main categories of solution providers within instruction-tuned LLMs: smaller projects with basic implementations relying heavily on existing LLMs which often inherit undocumented data and artifacts, and larger initiatives backed by organizations that aim to offer robust systems akin to ChatGPT. The authors highlight several challenges contributing to data and documentation sparsity including inheritance of undocumented data, non-sharing of reinforcement learning data crucial for instruction-tuning, and limited peer-reviewed scientific documentation.

Despite these challenges, some initiatives have succeeded in promoting increased openness:

  1. Open Source Projects: Notable mentions in the paper include the HuggingFace-sponsored BLOOMZ and mT0 models by the bigscience workshop and LAION-AI's OpenAssistant which leverage open RLHF datasets for more extensive tuning and interoperability in text generation.
  2. Transparency and Documentation: These projects often provide considerable transparency, offering project documentation, codebases, and licensing models that support reproducible and collaborative AI development.
  3. Performance and Utility: The paper discusses the pros and cons of smaller-scale projects which, while lacking comprehensive performance metrics when compared to proprietary solutions, offer accessible entry points for understanding LLM+RLHF tools.

Implications

The implications of this research extend into practical and theoretical dimensions within AI. Practically, the promotion of open data and documentation can stimulate reproducible research workflows and facilitate deeper understanding and evaluation of instruction-tuned models. Theoretically, incorporating openness and transparent practices could help demystify the architectures and dynamics of LLM+RLHF systems, thereby enabling scrutiny and fostering trust in AI systems. The advancement of open models could potentially counterbalance corporate advantages by democratizing access to cutting-edge AI technologies and data.

The findings emphasize the need for systematic incentives and regulatory frameworks to support open practices. Moreover, existing technological ecosystems should evolve toward incorporating citizen-generated data and models, thereby increasing interaction and bridging gaps between research communities and public resources.

Future Prospects

Future avenues of research in this domain will likely emphasize the continued exploration of the gradient between open and closed release methods and improving mechanisms for transparent legal and ethical documentation. Enhanced data curation practices would also play a significant role in fostering accountability. With rising calls for regulatory oversight and the democratization of AI capabilities, research collaborations and frameworks like those highlighted in this paper promise an ethical pathway toward robust, transparent, and equitable AI development.

In conclusion, the paper provides an insightful examination into the open-source landscape of instruction-tuned LLMs, highlighting both achievements and ongoing challenges in the quest for transparency and accountability in AI systems. The advocacy for openness might serve as a counterbalance to proprietary dominance, encouraging ethical practices that are aligned with scientific rigor and societal benefit.

Youtube Logo Streamline Icon: https://streamlinehq.com