Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Coding Dataset (MCD)

Updated 2 July 2026
  • Multimodal Coding Dataset (MCD) is a large-scale dataset with 598,000 examples integrating visual and textual inputs for code generation and comprehension.
  • The dataset employs multi-stage quality control and GPT-4o rewriting to ensure clarity, correctness, and functional outputs across diverse domains.
  • MCD drives the development and benchmarking of unified vision-language-code models, advancing research in multimodal instruction tuning.

The term Multimodal Coding Dataset (MCD) refers to one of the first large-scale resources for supervised instruction tuning of models intended to process and understand programming information in multimodal contexts, specifically integrating visual and textual inputs. MCD, introduced in the VisCodex framework, consists of 598,000 curated data points spanning HTML code generation, chart code synthesis, image-augmented code question answering, and algorithmic problem-solving. The corpus is designed to supply diverse, visually-grounded, and high-quality coding instances for training and benchmarking unified vision-language-code models (Jiang et al., 13 Aug 2025).

1. Composition and Scope

MCD encompasses 598,000 instruction-tuning examples, each constructed for the purpose of enabling code generation or comprehension conditioned on multimodal cues. The data is divided into four principal domains:

  • Enhanced HTML Code: 200,000 samples derived from 560,000 real web page screenshots (“style seeds”) with synthetic HTML+CSS generated via GPT-4o, rendered and filtered through Playwright and multi-stage QC to remove low-quality screenshots and artifacts.
  • Chart Image–Code Pairs: 210,000 samples; 164,000 synthetic (ChartCoder: matplotlib script plus chart render) and 46,000 high-quality, real Python+matplotlib scripts from GitHub, each script rewritten, normalized, and QC filtered for execution correctness and chart rendering quality using GPT-4o.
  • Image-Augmented StackOverflow QA: 59,000 instances mined from SO threads with at least one image and accepted Python or HTML answer. Data is processed to eliminate broken images, short/long answers, and sensitive content, with answers rewritten by GPT-4o for clarity.
  • Algorithmic Code: 129,000 entries aggregated from KodCode’s synthetic challenges, incorporating problems akin to LeetCode, Codeforces, and classic algorithmic repositories.

The overall scale ensures robust cross-modal diversity and coverage for visual and programmatic contexts.

2. Data Format, Annotations, and Quality Control

Each sample in MCD is represented as a JSON object with the following schema:

Field Description
id Unique string or integer identifier
image File path or base64-encoded PNG/JPEG (null for algorithmic code)
instruction NL prompt directing the model task (code generation, explanation, etc.)
input Optional text/code context (usually empty outside QA/algorithmic samples)
output Target code snippet in plain text format

Images are in PNG or JPEG format, with resolutions filtered to 400×300 px to 1,280×720 px; outliers are dropped.

Annotation is highly automated and layered:

  1. All code/natural language rewritten for normalization and clarity by GPT-4o.
  2. Automated rejectors enforce compilability/executability for code and correct rendering for images.
  3. Multi-stage, rule-based filters exclude malformed or low-aesthetic/functional outputs.
  4. Final GPT-4o-based scoring retains only the highest-quality solutions and visuals.

3. Statistics, Diversity, and Splits

MCD is used exclusively for supervised fine-tuning in the VisCodex pipeline, with no official train/val/test split. All 598,000 samples are shuffled for instruction tuning; evaluation always occurs on external held-out benchmarks.

Key per-domain statistics:

Domain # Samples Avg. Output Length (tokens, mean±std)
HTML 200,000 632 ± 144
Chart 210,000 551 ± 190
QA 59,000 1,022 ± 776
Algorithm 129,000 969 ± 321

Vocabulary size is not reported explicitly. Post-filtering, image resolution distribution is only summarized (no histograms). No formal diversity/coverage metrics are defined; code-token entropy could be computed as D=(1/N)i=1NH(ci)D = (1/N)\sum_{i=1}^N H(c_i) for NN samples with tokenized code cic_i, with HH denoting Shannon entropy.

4. Downstream Benchmarks and Evaluation Protocols

While not evaluated directly on MCD, models trained using it are benchmarked on four distinct datasets aligning with the MCD’s domains:

  • Design2Code (UI screenshot → HTML):
    • Low-level metrics: block-matching ratio, F1 for text, IoU of element boxes, pixel color agreement.
    • High-level: GPT-4 or human semantic judgment.
  • ChartMimic (Chart image → code):
    • Low-level: pixel-level render comps, SSIM.
    • High-level: GPT-4-rated semantic match.
  • MMCode (Visually-rich algorithmic QA):
    • Metric: pass@k accuracy, i.e., probability of producing at least one correct answer in kk attempts.
  • InfiBench-V (Image-augmented code QA):
    • Keyword matching score, unit testing (fraction of test cases passed), and GPT-Judge (GPT-4o comparative correctness).

These metrics are not used for MCD examples themselves but for models trained on MCD (Jiang et al., 13 Aug 2025).

5. Representative Examples

Selected samples illustrate the diversity of MCD tasks and data modalities:

  • HTML sample:

1
2
3
4
5
6
7
{
  "id": "html_000123",
  "image": "/images/html_000123.png",
  "instruction": "Generate HTML+CSS to match the above screenshot.",
  "input": "",
  "output": "<!DOCTYPE html>\n<html lang=\"en\"> ... </html>"
}

  • Chart sample:

1
2
3
4
5
6
7
{
  "id": "chart_045678",
  "image": "/charts/045678.png",
  "instruction": "Write Python matplotlib code to reproduce this chart.",
  "input": "",
  "output": "import matplotlib.pyplot as plt\nx = [1,2,3,4] ... plt.show()"
}

  • QA sample:

1
2
3
4
5
6
7
{
  "id": "qa_00089",
  "image": "/so_images/89.png",
  "instruction": "Why is my fixed-position footer not appearing at bottom of page?",
  "input": "<!doctype html>… CSS snippet …",
  "output": "Your body is set to height:100%; ..."
}

  • Algorithm sample:

1
2
3
4
5
6
{
  "id": "alg_0310",
  "instruction": "Given a binary string b, count the minimum flips so that it contains no substring \"010\".",
  "input": "b = \"0101010\"",
  "output": "def beautifulBinaryString(b): ..."
}

6. Significance and Research Applications

MCD is foundational for developing and evaluating models requiring nuanced understanding of both visual and code/textual cues. The data corpus is sufficiently broad to encompass:

  • Instruction-following code generation tasks grounded in screenshots or visualizations.
  • Chart interpretation and programmatic chart synthesis from images.
  • Visual-contextual code QA, as in StackOverflow troubleshooting with visual artifacts.
  • Pure algorithmic reasoning with or without visual modality.

It explicitly enables training of unified MLLMs that merge vision and code expertise, as in VisCodex, and supports comprehensive evaluation against benchmarks demanding visual-textual-code comprehension (Jiang et al., 13 Aug 2025).

7. Access and Future Directions

As of 2025, MCD is accessible in conjunction with the VisCodex project, subject to the usage policies outlined in that work. No vocabulary file or explicit dev/test splits are provided; extensions or alternative splits are left as open research directions. A plausible implication is that future work will define formal diversity metrics or expand the data to domains not yet represented, responding to ongoing community advances in multimodal code understanding (Jiang et al., 13 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Coding Dataset (MCD).