MCPCorpus: Dataset for MCP Ecosystem Analysis
- MCPCorpus is a comprehensive, normalized dataset of MCP servers and clients, aggregating details from MCP.so and GitHub for reproducible ecosystem studies.
- It features over 20 standardized attributes per artifact, ensuring consistent data integration and enabling quantitative analysis of code quality and engagement.
- The integrated tooling suite automates data synchronization, normalization, and inspection, facilitating robust research on adoption patterns and ecosystem health.
MCPCorpus is a large-scale, structured dataset and associated tooling suite for reproducible analysis of the Model Context Protocol (MCP) ecosystem. It provides a comprehensive, regularly updated snapshot of MCP servers and clients, including detailed technical and development metadata. By systematically aggregating, normalizing, and annotating artifacts from the public MCP registry (MCP.so) and GitHub, MCPCorpus enables empirical paper of adoption patterns, operational diversity, and ecosystem health in the rapidly evolving field of model-tool integrations.
1. Dataset Composition and Scope
MCPCorpus aggregates and normalizes artifacts from the MCP ecosystem, specializing in bridging registry-level and repository-level metadata:
- Artifact Coverage: The dataset encompasses approximately 14,000 MCP servers and 300 MCP clients. Each entry corresponds to either a distinct MCP server or client, as defined by the MCP.so registry and its associated GitHub repository.
- Annotation Format: Every artifact is represented as a JSON object following a unified schema. This schema combines registry information (e.g., artifact identity, category) with GitHub-derived measures of code quality and engagement.
- Data Provenance: The core of the dataset is constructed by joining the MCP registry (MCP.so) with code and project metadata harvested from GitHub. This dual-source approach enables direct linkage of interface-level configuration to concrete development activity.
- Reproducibility: All artifacts are maintained in a normalized format, with systematic deduplication and canonicalization (e.g., deduplicating by normalized GitHub URLs) to ensure results are robust to changes in upstream sources.
2. Attribute Schema and Metadata Normalization
MCPCorpus’s annotation framework consists of over 20 normalized attributes per artifact, organized into several logically distinct categories:
Field Category | Example Attributes | Functionality |
---|---|---|
Basic Information | id, name, title, description, author_name, url, tags, type | Artifact identity, brief summary, domain classification |
Interface/Config | tools, sse_url, server_command, server_config | Exposed MCP tools/APIs, runtime invocation details |
GitHub Signals | stargazers_count, forks_count, open_issues_count, | Quantitative development activity and community health |
contributors_count, last_commit | ||
GitHub Metadata | full_name, language, languages, license, archived | Codebase structure, licensing, maintenance status |
has_dockerfile, has_readme, has_requirements | Implementation quality markers |
This multi-faceted schema provides an integrated view of both runtime characteristics (e.g., interface exposure, deployment commands) and the collaborative, longitudinal attributes of software development. The normalization layer parses heterogeneous registry content and repository layouts into schema-conformant fields, with logic to flatten JSON blobs and to track language breakdown at the byte level.
3. Utility Tools for Synchronization, Normalization, and Inspection
To address pace of ecosystem change and reproducibility, MCPCorpus includes a utility toolkit that enables automation and consistent governance of the dataset:
- Data Synchronization: Scripts routinely poll MCP.so and GitHub APIs to fetch new or updated artifacts, preserving data currency as projects are created, renamed, forked, or archived.
- Metadata Normalization: Automated routines canonicalize registry and repository information to a flat, uniform schema. This includes conversion of language breakdowns, normalization of URLs, and the extraction of standardized signals from README or manifest files.
- Deduplication and Inspection: Utilities identify duplicate or stale entries by hashing, canonicalizing project identities, and filtering by consistent criteria such as last commit dates.
- Export Format: The dataset is output as newline-delimited JSON (JSONL), compatible with standard research tooling for data science and large-scale natural LLM integration pipelines.
4. Web-Based Search and Exploration Interface
MCPCorpus is distributionally coupled with a web-based search frontend tailored for efficient inspection and querying:
- Attribute-Based Filtering: The interface accepts queries using any of the normalized attributes, such as filtering by implementation language, popularity (star count), interface tags, or maintenance status.
- Category Navigation: Artifacts are discoverable by category (as assigned in MCP.so), enabling targeted studies, for example, of all servers exposing a particular toolset or APIs marked as belonging to a specific domain.
- Interface Inspection: Users may drill down into detailed metadata for any artifact (e.g., to view the exact set of tool endpoints exposed, Dockerization status, code activity metrics).
- Application: This interface is designed to facilitate both qualitative and quantitative ecosystem research, such as identifying patterns in deployment, code diversity, or project longevity.
5. Research Applications and Implications
The MCPCorpus dataset and associated infrastructure support a range of advanced ecosystem analyses:
- Ecosystem Adoption Analysis: By examining attributes such as creation timestamps, star/commit time series, and contributor metrics, researchers can reconstruct the evolutionary trajectory of MCP adoption and detect emergent trends.
- Security Audit and Code Quality: Detailed interface and configuration metadata (e.g., presence of Dockerfiles, frequency of updates, README completeness) provide empirical signals for automated analysis of vulnerability exposure and secure-by-design patterns. MCPCorpus is positioned as a benchmark for security posture studies within the MCP landscape.
- Implementation Diversity and Interoperability: The detailed annotation of programming languages, dependencies, and interface configuration enables studies on cross-compatibility and technical uniformity/divergence within the MCP ecosystem.
- Longitudinal Studies: Regularly updated, normalized metadata enable longitudinal tracking of project lifecycles, repository health, license variations, and organizational adoption flows.
- Empirical Benchmarking: MCPCorpus supplies a standardized ground truth for data-driven benchmarking across tool-augmented LLM agent infrastructures.
A plausible implication is that the structured and extensible nature of the dataset facilitates reproducible measurement of ecosystem health and implementation diversity, enabling quantified assessment of MCP-based integrations over time.
6. Dataset Structure and Mathematical Considerations
While the MCPCorpus paper does not explicitly provide new mathematical models or formulas for dataset analysis, it employs LaTeX for tabular schema, artifact attribute tabulation, and figure rendering. Quantitative analysis—such as statistical correlation of star counts, contributor activity, interface uniformity—can be conducted externally using the provided attributes:
- No custom mathematical metric or formalism is presented in the underlying paper; all mathematical modeling is left to downstream users of the dataset.
- The artifact annotation schema is designed to support such modeling by providing normalized, consistent variable types, including integer signals (star/fork count), timestamp records, and enumerated types (categories, tags).
7. Significance for the MCP Tooling Research Community
MCPCorpus provides a reproducible, evolving, and richly annotated foundation for empirical research into the MCP ecosystem, addressing the challenge of fragmentation in the tool-augmented LLM landscape. By standardizing data representation and providing scalable infrastructure for synchronization, normalization, and search, it underpins comparative, longitudinal, and code-level analyses.
Researchers using MCPCorpus gain a robust reference set to measure adoption trends, paper implementation variation, and benchmark agent interoperability, while practitioners are empowered to perform practical vulnerability assessments and integration feasibility studies across the expanding universe of MCP artifacts.
MCPCorpus is publicly available at https://github.com/Snakinya/MCPCorpus, with ongoing maintenance infrastructure to accommodate and mirror the rapid evolution of the protocol and its applications in production environments (Lin et al., 30 Jun 2025).