- The paper demonstrates that Hugging Face Hub’s model activity is highly skewed, with over 70% of models having zero downloads and only 1% accounting for nearly all usage.
- The paper reveals that 87% of model repositories have a single contributor and 89% of developers work in isolation, highlighting limited collaborative engagement.
- The paper finds that repositories with permissive licenses enjoy higher activity and engagement, emphasizing important implications for open source policy and governance.
Open Source AI: A Deep Dive into the Hugging Face Hub
Development Activity in the HF Hub
Researchers have been intrigued by the patterns of development activity on platforms like Hugging Face (HF) Hub, particularly in the context of open source AI. This paper analyzed various activities across 348,181 model repositories, 65,761 dataset repositories, and 156,642 spaces on the HF Hub.
- Right-Skewed Distributions: The HF Hub shows a very imbalanced activity distribution. For example, over 70% of models have zero downloads, while a mere 1% account for 99% of all downloads. Similarly right-skewed distributions are observed in likes, discussions, and commits, reflective of the Pareto principle.
- Community Engagement: Community size correlates with activity levels. However, most repositories have limited collaborative engagement, with around 87% of model repositories having only one contributor. High reciprocity values indicate prevalent mutual relationships among active developers.
- Licenses: A significant number of repositories lack licenses, with 65% of model repositories and 72% of datasets undocumented in this regard. However, repositories with permissive licenses exhibit higher activity levels and engagement.
Collaboration Patterns
Analysis of the social network structure of collaboration in model repositories reveals a core-periphery structure with a few prolific developers at the core:
- Isolate Developers: About 89% of developers work in isolation without collaboration.
- Core-Periphery Structure: Collaboration is characterized by a dense core of prolific developers. High reciprocity values suggest that collaborations are mutual and not dependent on one-sided efforts.
- High Modularity: The developer community is made up of distinct communities in various AI sub-fields like NLP and computer vision, exhibiting high modularity at lower k-core thresholds which indicates loosely connected groups.
Model Adoption in Spaces
Model adoption in spaces, an indicator of practical use and impact, also reveals a high concentration among a few key players:
- Right-Skewed Distribution: Similar to activity distributions, model adoption displays a right-skewed distribution where a minority of models see significant usage. For instance, runwayml/stable-diffusion-v1-5 is used in over 1,747 spaces.
- Dominance by Big Tech: Major players like Meta, Google, and Stability AI are behind the most adopted models, indicating a high concentration of influence. Models from industry leaders are not only the most used but also highly co-used with other models.
- Correlation with Likes: A strong positive correlation exists between the number of likes a model repository receives and its usage in spaces, highlighting likes as a good indicator of model adoption.
Implications and Future Research Directions
The findings underscore significant concentration and imbalance in development activities and model adoption within the HF Hub community. Here are some underlying implications:
- Licensing: Given the substantial number of unlicensed repositories, developers and platform providers should emphasise the importance of proper licensing to facilitate collaboration and avoid legal complications.
- Community Engagement: Encouraging more collaboration, especially among isolated developers, could lead to richer, more diverse contributions in open source AI.
- Policy and Governance: Policymakers should take note of these concentrated influences, especially by industry leaders, to inform discussions on the benefits, risks, and governance of open models.
- Methodological Considerations: Future research should expand beyond the HF Hub to include other platforms and compare development and collaboration patterns. Additionally, temporal analyses could provide more nuanced insights into the dynamics of open source AI development.
- Understanding Developer Incentives: Investigating what drives individual developers and companies to contribute to open source AI could illuminate paths to more balanced and democratic participation.
This paper not only extends the literature on open source AI but also provides practical insights for researchers, developers, policymakers, and platform providers. As open models continue to proliferate, a deeper understanding of these collaborative practices will be essential for fostering a sustainable and equitable AI ecosystem.