PeaTMOSS Dataset Analysis
- PeaTMOSS is a multi-modal dataset aggregating over 280,000 pre-trained model entries and mapping them to 28,575 GitHub projects.
- It employs LLM-driven pipelines for automatic metadata extraction, ensuring schema adherence across model documentation.
- The dataset enables empirical studies on PTM supply chains, dependency networks, documentation quality, and licensing compliance.
PeaTMOSS (Pre-Trained Models in Open-Source Software) is a large-scale, multi-modal dataset designed to enable empirical studies of the supply chain of pre-trained deep learning models (PTMs) and their integration into open-source software (OSS). PeaTMOSS compiles and cross-references rich metadata on over 280,000 PTMs sourced primarily from Hugging Face and PyTorch Hub and includes detailed mappings to tens of thousands of GitHub repositories that reuse these models. The dataset is uniquely positioned to support quantitative and qualitative analyses of PTM development, maintenance, documentation, licensing, and downstream usage, addressing critical gaps in understanding the propagation and evolution of learned artifacts within modern software ecosystems.
1. Dataset Structure and Coverage
PeaTMOSS consists of several interrelated components, making it one of the most comprehensive resources for studying PTM reuse in OSS:
- PTM Registry Snapshots: PeaTMOSS provides snapshots of 281,276 PTM packages from major registries, capturing model names, versions, frameworks, and registry-derived metadata.
- Open-Source Project Collection: The dataset catalogs 28,575 GitHub projects that utilize PTMs, along with information on their commits, issues, pull requests, and repository histories.
- PTM–Project Mappings: A curated table enumerates 44,337 explicit mappings between 2,530 distinct PTMs and 15,129 downstream GitHub repositories that depend on them.
- Data Formats: The dataset is released in two principal versions: a metadata-focused SQLite database and a full release with complete git repository histories, available via Globus.
- Metadata Schema: The schema supports long-term, cross-modal joinability, capturing both tabular and unstructured data from model cards and code repositories.
This structure enables longitudinal, cross-sectional, and network-oriented analyses of PTM adoption and propagation throughout the OSS landscape (Jiang et al., 1 Feb 2024, Yasmin et al., 7 Sep 2025).
2. Automated Metadata Extraction and Enrichment
The unstandardized nature of PTM documentation—primarily model cards written in freeform Markdown—posed an initial challenge. PeaTMOSS addresses this via LLM–driven automatic extraction pipelines:
- LLM Pipelines: Two approaches were developed. The cost-efficient pipeline uses GPT-3.5 combined with a Retrieval-Augmented Generation (RAG) framework to reduce token input cost, while the more accurate pipeline leverages GPT-4-turbo and processes full model cards without RAG.
- Schema Adherence: Iterative prompt refinement ensures all extracted metadata complies with a predetermined JSON schema capturing fields such as architecture, number of parameters, training datasets, evaluation metrics, hyperparameters, license, base model, limitations, biases, and carbon emissions.
- Coverage and Completeness: Quantitative summary statistics show that while the majority of PTMs provide core fields (e.g., domain, library, tasks with 98.9% coverage), fewer than half include input/output format, hyperparameters, limitations, sponsorship, or carbon estimates.
Centralized, queryable metadata enables researchers to analyze PTMs using statistical representations, e.g., computing proportions of missing values and measuring documentation completeness (Jiang et al., 1 Feb 2024).
3. Trends, Documentation Quality, and Domain Analysis
Systematic analysis of PeaTMOSS reveals domain shifts and patterns in PTM growth and documentation:
- Domain Distribution: NLP accounts for approximately 60.3% of Hugging Face PTMs; in contrast, PyTorch Hub is dominated by Computer Vision (CV) and audio models.
- Temporal Dynamics: Since August 2022, the domain mix at major hubs has broadened, with increasing representation of multimodal PTMs.
- Model Scaling: The median number of parameters in NLP and multimodal models has increased markedly in recent years, while Audio and CV model sizes remain relatively stable.
- Documentation Shortcomings: Core metadata fields are generally well-covered, but significant gaps persist for advanced details (e.g., hyperparameters, limitations, biases), substantiating the need for improved standardization in model cards.
Overall, the dataset illustrates both the rapid expansion and the persistent idiosyncrasy of PTM documentation and downstream supply chain metadata.
4. Software Dependency Integration and Pipeline Taxonomy
PeaTMOSS underpins studies of Software Dependencies 2.0, describing how PTMs introduce new forms of learned, rather than code-centric, dependencies in software projects (Yasmin et al., 7 Sep 2025):
- Integration Patterns: Empirical analysis of 401 representative GitHub repositories reveals developers predominantly declare PTM dependencies in source code (59%), with only 21.2% providing centralized external documentation (README/configuration files) and just 12% specifying PTM versioning.
- Pipeline Taxonomy: Ten canonical stages in the PTM reuse pipeline are identified: Model Initialization, Model Adaptation, Data Processing, Prompt Generation, Optional Feature Engineering, Fine-Tuning, Inference, Optional Post-Processing, Evaluation, and Delivery.
- Organizational Architectures: Pipelines cluster into feature extraction–oriented, generative (with prompt generation), and discriminative types—each with implications for reproducibility, maintainability, and technical debt. A typical pipeline spans 3.9 dedicated source files and 886 lines of code (median: 2 files, 361 LOC), evidencing the nontrivial nature of PTM reuse.
- Adaptation Practices: Strategies include using PTMs as-is, augmenting architectures with new heads or adapters, and customizing configuration for domain-specific specialization.
The non-uniform, distributed manner of dependency declaration and adaptation introduces maintenance and reproducibility challenges not addressed by existing dependency management approaches.
5. Multi-PTM Interactions and Dependency Networks
A salient feature exposed by PeaTMOSS is the prevalence and complexity of multi-PTM reuse within single projects:
- Prevalence: Approximately 52.6% of analyzed projects use multiple PTMs, with 37% demonstrating interchangeable (same-family) dependencies and 23% employing complementary (distinct role/modality) relationships.
- Interaction Taxonomy: Four primary types are observed across pipeline stages:
- Feature Handoff (42%): Output of one PTM (e.g., embeddings) serves as input to another.
- Feedback Guidance (45%): One PTM provides evaluative or supervisorial signals for another (e.g., CLIP similarity during generative training).
- Evaluation (9%): PTMs compute metrics or diagnostics post hoc.
- Post-Processing Refinement (4%): PTMs validate/refine outputs of another in post-processing.
- Network Implications: Multi-model interactions—especially feedback loops and hierarchical handoff patterns—substantially increase architectural complexity, propagating downstream maintenance burdens and coupling changes across components.
This infrastructure supports graph-structured analyses of PTM supply chain centrality, dependency networks, and propagation of model updates.
6. Licensing, Compliance, and Supply Chain Risks
Licensing heterogeneity and coordination challenges are notable supply chain issues illuminated by the PeaTMOSS dataset:
- License Mapping: Empirical mapping using PeaTMOSS reveals that while most PTM–downstream project pairs have compatible licenses, approximately 0.24 of mappings show conflicts, primarily resulting from copyleft–permissive license mismatches.
- License Coverage: Nearly 43% of downstream repositories lack any explicit license, complicating legal clarity for end users and maintainers.
- Case Applications: Sankey diagram visualizations derived from PeaTMOSS trace license flows and identify risk regions. This enables quantitative analysis of compliance and highlights the need for more rigorous license propagation practices.
The dataset offers a basis for future research exploring the legal and organizational risks in PTM-centered software supply chains.
7. Research Directions and Tooling Opportunities
PeaTMOSS enables a range of future directions in empirical software engineering and provides building blocks for practical tool development:
- Supply Chain Mining: Quantitative modeling of evolutionary trajectories, fine-tuning chains ("phylogeny"), and PTM popularity predictors.
- Quality Correlation Studies: Analysis of potential links between PTM metadata quality, documentation completeness, and downstream project success.
- Coordination and Documentation: Investigation of the co-evolution between PTM versioning schemes and OSS project lifecycles.
- Search and Selection Tools: Development of engineering platforms for PTM discovery, evaluation, and comparison based on structured metadata.
- Dependency Management Frameworks: Opportunities for community tooling to manage "semantic edges" of PTM dependencies and automate license compliance checking.
A plausible implication is that the advent of PeaTMOSS marks a shift toward data-driven, supply-chain–aware maintenance and engineering of PTM-rich software, supporting both MSR researchers and practitioners in managing increasingly complex machine learning–centric OSS ecosystems (Jiang et al., 1 Feb 2024, Yasmin et al., 7 Sep 2025).