MCPCorpus: Dataset for MCP Ecosystem Analysis

Updated 9 September 2025

MCPCorpus is a comprehensive, normalized dataset of MCP servers and clients, aggregating details from MCP.so and GitHub for reproducible ecosystem studies.
It features over 20 standardized attributes per artifact, ensuring consistent data integration and enabling quantitative analysis of code quality and engagement.
The integrated tooling suite automates data synchronization, normalization, and inspection, facilitating robust research on adoption patterns and ecosystem health.

MCPCorpus is a large-scale, structured dataset and associated tooling suite for reproducible analysis of the Model Context Protocol (MCP) ecosystem. It provides a comprehensive, regularly updated snapshot of MCP servers and clients, including detailed technical and development metadata. By systematically aggregating, normalizing, and annotating artifacts from the public MCP registry (MCP.so) and GitHub, MCPCorpus enables empirical study of adoption patterns, operational diversity, and ecosystem health in the rapidly evolving field of model-tool integrations.

1. Dataset Composition and Scope

MCPCorpus aggregates and normalizes artifacts from the MCP ecosystem, specializing in bridging registry-level and repository-level metadata:

Artifact Coverage: The dataset encompasses approximately 14,000 MCP servers and 300 MCP clients. Each entry corresponds to either a distinct MCP server or client, as defined by the MCP.so registry and its associated GitHub repository.
Annotation Format: Every artifact is represented as a JSON object following a unified schema. This schema combines registry information (e.g., artifact identity, category) with GitHub-derived measures of code quality and engagement.
Data Provenance: The core of the dataset is constructed by joining the MCP registry (MCP.so) with code and project metadata harvested from GitHub. This dual-source approach enables direct linkage of interface-level configuration to concrete development activity.
Reproducibility: All artifacts are maintained in a normalized format, with systematic deduplication and canonicalization (e.g., deduplicating by normalized GitHub URLs) to ensure results are robust to changes in upstream sources.

2. Attribute Schema and Metadata Normalization

MCPCorpus’s annotation framework consists of over 20 normalized attributes per artifact, organized into several logically distinct categories:

Field Category	Example Attributes	Functionality
Basic Information	id, name, title, description, author_name, url, tags, type	Artifact identity, brief summary, domain classification
Interface/Config	tools, sse_url, server_command, server_config	Exposed MCP tools/APIs, runtime invocation details
GitHub Signals	stargazers_count, forks_count, open_issues_count,	Quantitative development activity and community health
	contributors_count, last_commit
GitHub Metadata	full_name, language, languages, license, archived	Codebase structure, licensing, maintenance status
	has_dockerfile, has_readme, has_requirements	Implementation quality markers

This multi-faceted schema provides an integrated view of both runtime characteristics (e.g., interface exposure, deployment commands) and the collaborative, longitudinal attributes of software development. The normalization layer parses heterogeneous registry content and repository layouts into schema-conformant fields, with logic to flatten JSON blobs and to track language breakdown at the byte level.

3. Utility Tools for Synchronization, Normalization, and Inspection

To address pace of ecosystem change and reproducibility, MCPCorpus includes a utility toolkit that enables automation and consistent governance of the dataset:

Data Synchronization: Scripts routinely poll MCP.so and GitHub APIs to fetch new or updated artifacts, preserving data currency as projects are created, renamed, forked, or archived.
Metadata Normalization: Automated routines canonicalize registry and repository information to a flat, uniform schema. This includes conversion of language breakdowns, normalization of URLs, and the extraction of standardized signals from README or manifest files.
Deduplication and Inspection: Utilities identify duplicate or stale entries by hashing, canonicalizing project identities, and filtering by consistent criteria such as last commit dates.
Export Format: The dataset is output as newline-delimited JSON (JSONL), compatible with standard research tooling for data science and large-scale natural LLM integration pipelines.

4. Web-Based Search and Exploration Interface

MCPCorpus is distributionally coupled with a web-based search frontend tailored for efficient inspection and querying:

Attribute-Based Filtering: The interface accepts queries using any of the normalized attributes, such as filtering by implementation language, popularity (star count), interface tags, or maintenance status.
Category Navigation: Artifacts are discoverable by category (as assigned in MCP.so), enabling targeted studies, for example, of all servers exposing a particular toolset or APIs marked as belonging to a specific domain.
Interface Inspection: Users may drill down into detailed metadata for any artifact (e.g., to view the exact set of tool endpoints exposed, Dockerization status, code activity metrics).
Application: This interface is designed to facilitate both qualitative and quantitative ecosystem research, such as identifying patterns in deployment, code diversity, or project longevity.

5. Research Applications and Implications

The MCPCorpus dataset and associated infrastructure support a range of advanced ecosystem analyses:

Ecosystem Adoption Analysis: By examining attributes such as creation timestamps, star/commit time series, and contributor metrics, researchers can reconstruct the evolutionary trajectory of MCP adoption and detect emergent trends.
Security Audit and Code Quality: Detailed interface and configuration metadata (e.g., presence of Dockerfiles, frequency of updates, README completeness) provide empirical signals for automated analysis of vulnerability exposure and secure-by-design patterns. MCPCorpus is positioned as a benchmark for security posture studies within the MCP landscape.
Implementation Diversity and Interoperability: The detailed annotation of programming languages, dependencies, and interface configuration enables studies on cross-compatibility and technical uniformity/divergence within the MCP ecosystem.
Longitudinal Studies: Regularly updated, normalized metadata enable longitudinal tracking of project lifecycles, repository health, license variations, and organizational adoption flows.
Empirical Benchmarking: MCPCorpus supplies a standardized ground truth for data-driven benchmarking across tool-augmented LLM agent infrastructures.

A plausible implication is that the structured and extensible nature of the dataset facilitates reproducible measurement of ecosystem health and implementation diversity, enabling quantified assessment of MCP-based integrations over time.

6. Dataset Structure and Mathematical Considerations

While the MCPCorpus paper does not explicitly provide new mathematical models or formulas for dataset analysis, it employs LaTeX for tabular schema, artifact attribute tabulation, and figure rendering. Quantitative analysis—such as statistical correlation of star counts, contributor activity, interface uniformity—can be conducted externally using the provided attributes:

No custom mathematical metric or formalism is presented in the underlying paper; all mathematical modeling is left to downstream users of the dataset.
The artifact annotation schema is designed to support such modeling by providing normalized, consistent variable types, including integer signals (star/fork count), timestamp records, and enumerated types (categories, tags).

7. Significance for the MCP Tooling Research Community

MCPCorpus provides a reproducible, evolving, and richly annotated foundation for empirical research into the MCP ecosystem, addressing the challenge of fragmentation in the tool-augmented LLM landscape. By standardizing data representation and providing scalable infrastructure for synchronization, normalization, and search, it underpins comparative, longitudinal, and code-level analyses.

Researchers using MCPCorpus gain a robust reference set to measure adoption trends, study implementation variation, and benchmark agent interoperability, while practitioners are empowered to perform practical vulnerability assessments and integration feasibility studies across the expanding universe of MCP artifacts.

MCPCorpus is publicly available at https://github.com/Snakinya/MCPCorpus, with ongoing maintenance infrastructure to accommodate and mirror the rapid evolution of the protocol and its applications in production environments (Lin et al., 30 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MCPCorpus.