MLCPD: Universal Code Parser Dataset

Updated 25 October 2025

MultiLang Code Parser Dataset (MLCPD) is a unifying corpus that provides hierarchical AST representations and semantic mappings for ten major programming languages.
It integrates over 7 million parsed source files with rich metadata and a universal schema, enabling scalable analysis, cross-language reasoning, and effective model training.
Empirical analyses reveal high parsing accuracy and significant syntactic regularity, validating its application for multilingual program analysis and research.

The MultiLang Code Parser Dataset (MLCPD) is a large-scale, language-agnostic corpus designed to unify syntactic and structural representations of source code across ten principal programming languages. Unlike prior datasets focused on token-level content or isolated parser outputs, MLCPD provides hierarchical Abstract Syntax Tree (AST) representations and semantic abstractions, encapsulated within a universal schema to enable consistent cross-language reasoning, model training, and analysis tasks for multilingual software.

1. Dataset Composition

MLCPD contains over seven million parsed source files sourced from permissively licensed repositories, notably the StarCoder dataset. The coverage spans ten major languages—C, C++, C#, Go, Java, JavaScript, Python, Ruby, Scala, and TypeScript. Each file entry is meticulously analyzed and stored with:

Complete hierarchical ASTs structured in JSON that preserve all syntactic elements.
Rich metadata, including line counts, average line lengths, AST node counts, error metrics, and SHA-256-based source fingerprints.
Multiple layers of abstraction, capturing atomic syntax as well as semantic aggregations (e.g., mapping specific syntax to language-agnostic function/type categories).
Serialization in Parquet format to ensure scalable and efficient data retrieval suitable for large-model pretraining and graph-based analysis.

MLCPD guarantees a lossless representation: no syntactic content is omitted or collapsed, and every parsed node remains available for downstream structural tasks.

2. Universal AST Schema

Central to MLCPD is a universal AST schema structured into four hierarchical layers:

Metadata Block: Describes global file properties (e.g., lines, avg_line_length, [nodes](https://www.emergentmind.com/topics/neural-ordinary-differential-equations-nodes), errors, source_hash). This block gives immediate access to file granularity, parse health, and identity for deduplication.
Flat Node Array: Linearizes the AST such that each node contains an identifier, type string, associated code snippet, parent-child relationship indices, and file-level positions. This facilitates O(1) traversal and high-throughput analysis.
Node Categorization: Aggregates nodes into a universal taxonomy—declarations (e.g., classes, functions), statements (e.g., control flow, returns), and expressions. For example, a mapping might link the AST nodes for Python def or Java public static void into a shared "function declaration" role.
Cross-Language Map: Normalizes heterogeneous language constructs under shared semantic concepts. This layer’s JSON mappings enable, for instance, research queries or program analyses that treat structurally-similar entities equivalently regardless of surface syntax.

Visualizations in the paper demonstrate that programs as varied as Python “def” blocks and Java methods parse into aligned representations when traversed under the schema. Algorithmic details (Algorithm 1: Extract AST Structure, Algorithm 2: Create Cross-Language Map) are provided for schema construction.

3. Cross-Language Reasoning Capabilities

By enforcing a uniform schema, MLCPD facilitates comparative and transfer analyses across languages:

Functionally-equivalent constructs (functions, conditionals, classes) are cross-mapped so structural queries or graph-based representation learning can operate without language-specific branching.
The Cross-Language Map layer guarantees that semantic roles (e.g., "function", "class") remain consistent across the ten languages, supporting tasks such as multilingual program translation, cross-language vulnerability scanning, and universal static analysis.

Empirical case studies (e.g., comparing "Age Check" programs in Python and Java, detailed in Figure 1 of the paper) show high cross-language AST alignment and semantic regularity.

4. Empirical Structural Analysis

The paper presents comprehensive empirical characterizations:

Parsing Success Rate: MLCPD achieves a 99.99994% success rate (7,021,718 out of 7,021,722 files).
Statistical and Visual Analyses: Pie charts and bar plots illustrate uniform language representation; node density plots compare syntactic compactness (e.g., Go and C++ have higher node counts per file than Python).
Cross-language Similarity: The dataset includes a cosine similarity matrix of node-type distributions, revealing high syntactic regularities (e.g., C and C++ at 0.90+, JavaScript and TypeScript at 0.96).
Principal Component Analysis (PCA): Scatter plots from PCA show natural clustering of structurally similar languages—such as Java with Scala and JavaScript with TypeScript—validating the schema’s ability to capture shared structures.

These analyses confirm that code drawn from disparate languages can reliably be embedded and compared under MLCPD’s schema.

5. Tools and Ecosystem

MLCPD is distributed with an open-source toolkit enabling:

Dataset Reproduction Pipelines: Scripts to process, normalize, and parse raw code using Tree-sitter grammars, culminating in universal schema generation.
Grammar Compilation: Pre-compiled Tree-sitter grammar libraries for deterministic, scalable parsing across all ten languages.
Visualization Utilities: Interactive exploration of unified ASTs, allowing direct comparison of language constructs and semantic roles.
Schema Validation and Parquet Serialization: Ensures consistency and efficient access for large-scale analyses.

Pipelines are provided for every preprocessing, parsing, and normalization step, with validation tools enforcing strict schema conformance.

6. Availability and Open Access

MLCPD is publicly released to promote reproducibility and collaborative research:

The complete dataset is hosted on Hugging Face (https://huggingface.co/datasets/jugalgajjar/MultiLang-Code-Parser-Dataset).
The codebase, including reproduction, normalization, grammar compilation, and visualization scripts, is available on GitHub (https://github.com/JugalGajjar/MultiLang-Code-Parser-Dataset).

These resources are complemented by extensive documentation and schema definitions, ensuring immediate usability by academic and industrial research groups.

7. Significance and Future Directions

MLCPD sets a precedent for open, unified, language-agnostic program analysis datasets, serving as a reproducible foundation for advancing research in cross-language representation learning, graph-based modeling, and multilingual static analysis. Its universal schema and high level of abstraction enable robust structural comparison, diverse program mining, and unified model training. The dataset’s empirical findings reveal that core syntactic graphs are highly alignable across major programming languages—a critical insight for future work on multilingual neural program representations and automated reasoning over heterogeneous codebases.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to MultiLang Code Parser Dataset (MLCPD).