Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators (2307.05532v1)

Published 8 Jul 2023 in cs.CL

Abstract: LLMs that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary LLM for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.

References (59)

Authors (3)

Andreas Liesenfeld (4 papers)
Alianda Lopez (2 papers)
Mark Dingemanse (3 papers)

Citations (74)

View on Semantic Scholar

Summary

Openness in Instruction-Tuned LLMs

The paper "Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators" authored by Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse presents a critical examination of the degrees of openness within the field of instruction-tuned LLMs, specifically focusing on alternatives to proprietary models like ChatGPT. This paper evaluates the dimensions of openness that pertain to code, data, model weights, RLHF data, licensing, and documentation. It provides a snapshot of the current landscape of open-source models that challenge proprietary constraints by emphasizing the importance of transparent, accountable, and reproducible research practices in artificial intelligence development.

Key Findings

The paper identifies two main categories of solution providers within instruction-tuned LLMs: smaller projects with basic implementations relying heavily on existing LLMs which often inherit undocumented data and artifacts, and larger initiatives backed by organizations that aim to offer robust systems akin to ChatGPT. The authors highlight several challenges contributing to data and documentation sparsity including inheritance of undocumented data, non-sharing of reinforcement learning data crucial for instruction-tuning, and limited peer-reviewed scientific documentation.

Despite these challenges, some initiatives have succeeded in promoting increased openness:

Open Source Projects: Notable mentions in the paper include the HuggingFace-sponsored BLOOMZ and mT0 models by the bigscience workshop and LAION-AI's OpenAssistant which leverage open RLHF datasets for more extensive tuning and interoperability in text generation.
Transparency and Documentation: These projects often provide considerable transparency, offering project documentation, codebases, and licensing models that support reproducible and collaborative AI development.
Performance and Utility: The paper discusses the pros and cons of smaller-scale projects which, while lacking comprehensive performance metrics when compared to proprietary solutions, offer accessible entry points for understanding LLM+RLHF tools.

Implications

The implications of this research extend into practical and theoretical dimensions within AI. Practically, the promotion of open data and documentation can stimulate reproducible research workflows and facilitate deeper understanding and evaluation of instruction-tuned models. Theoretically, incorporating openness and transparent practices could help demystify the architectures and dynamics of LLM+RLHF systems, thereby enabling scrutiny and fostering trust in AI systems. The advancement of open models could potentially counterbalance corporate advantages by democratizing access to cutting-edge AI technologies and data.

The findings emphasize the need for systematic incentives and regulatory frameworks to support open practices. Moreover, existing technological ecosystems should evolve toward incorporating citizen-generated data and models, thereby increasing interaction and bridging gaps between research communities and public resources.

Future Prospects

Future avenues of research in this domain will likely emphasize the continued exploration of the gradient between open and closed release methods and improving mechanisms for transparent legal and ethical documentation. Enhanced data curation practices would also play a significant role in fostering accountability. With rising calls for regulatory oversight and the democratization of AI capabilities, research collaborations and frameworks like those highlighted in this paper promise an ethical pathway toward robust, transparent, and equitable AI development.

In conclusion, the paper provides an insightful examination into the open-source landscape of instruction-tuned LLMs, highlighting both achievements and ongoing challenges in the quest for transparency and accountability in AI systems. The advocacy for openness might serve as a counterbalance to proprietary dominance, encouraging ethical practices that are aligned with scientific rigor and societal benefit.

YouTube

Show All Videos