Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 70 tok/s
Gemini 2.5 Flash 169 tok/s Pro
Gemini 2.5 Pro 47 tok/s Pro
Kimi K2 194 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

JanusCode-800K Multimodal Code Corpus

Updated 3 November 2025
  • JanusCode-800K is a comprehensive multimodal dataset merging instructional text, executable code, and rendered visuals for unified model training.
  • It uses a hierarchical synthesis and quality control pipeline to ensure high code execution fidelity, visual clarity, and accurate instruction alignment.
  • The corpus underpins models like JanusCoder and JanusCoderV, demonstrating improved performance in code structure, visual fidelity, and cross-domain generalization.

JanusCode-800K is a large-scale, multimodal code corpus designed to enable the training and benchmarking of unified visual-programmatic models. Comprising over 800,000 samples, JanusCode-800K addresses acute bottlenecks in scalable data synthesis, quality assessment, and cross-domain coverage for tasks at the intersection of code intelligence and visual reasoning. It underpins the development of JanusCoder and JanusCoderV, models that extend code generation and understanding to both text-centric and vision-centric inputs, supporting comprehensive evaluation across instructional, visual, and programmatic axes.

1. Corpus Structure and Coverage

JanusCode-800K constitutes a broad collection of multimodal samples that systematically merge instructional text, executable source code, and rendered visual output. The construction spans a diverse programmatic landscape, including classic charts and visualizations, interactive web UIs, code-driven animations, scientific demonstrations, and algorithmic artifacts. Distinct modalities are organized as follows:

Data Type # Samples
Python Visualization: Generation 127.5K
Python Visualization: Editing 51.8K
SVG 20.0K
Animation 19.5K
General Artifacts 56.8K
Algorithm Data 100.0K
Scientific PLs 31.8K
Chart-to-Code 70.0K
WebUI Generation 200.0K
WebUI Editing 69.5K
Scientific Demonstration 53.0K

Both text-centric tasks (e.g., instruction-to-code) and vision-centric tasks (e.g., chart-to-code and webpage editing) are robustly represented. This distribution promotes balanced training for unified code intelligence models that must reason across modalities.

2. Modalities: Data Types, Tasks, and Programming Languages

JanusCode-800K provides comprehensive support for a range of input and output modalities across three axes:

Visual Outputs include static charts (e.g., Matplotlib, Seaborn), SVG vector graphics, Manim/3Blue1Brown-style animations, interactive web applications (HTML, CSS, JavaScript), scientific demonstrations (Wolfram Mathematica, R, Matlab), and general artifacts such as games and management systems.

Code Modalities span Python (emphasizing data and visualization), scientific languages (Matlab, Wolfram Language, R), web technologies (HTML, CSS, JavaScript), SVG, and domain-specific animation scripts (Manim).

Supported Tasks encompass instruction-to-code, chart-to-code, webpage editing/generation (including image/screen-to-code), algorithmic problem solving, scientific code translation, and bidirectional translation between domains and languages.

This breadth ensures JanusCode-800K supports not only standard code-generation and reasoning tasks but also complex, visually entangled tasks required for effective visual-programmatic interfaces.

3. Synthesis Toolkit and Data Curation Pipeline

The JanusCode-800K corpus is synthesized using a hierarchical toolkit designed to maximize data quality and coverage. The synthesis pipeline incorporates:

  1. Data Collection: Aggregation from large-scale public datasets (e.g., StackV2, WebCode2M), web crawls, and specialized sources such as Wolfram Demonstrations.
  2. Data Curation: Multifaceted strategies transform and extend raw samples:
    • Guided Evolution promotes diversity via meta-task-driven mutation of instruction/code pairs.
    • Re-contextualization improves instruction-code alignment and increases semantic richness.
    • Reverse Instruction generates natural language instructions for code-only assets.
    • Bidirectional Translation amplifies cross-domain and cross-lingual coverage.
  3. Quality Control: Samples undergo:
    • Automated code execution within sandboxed environments corresponding to their language and runtime.
    • LLM or Vision LLM (VLM)-based multi-objective reward modeling and scoring, with each dimension (task relevance, completeness, code quality, visual clarity) scored 1–5.
    • Only samples achieving score S=R(I,C,V)>S = R(I, C, V) > threshold (where II is instruction, CC is code, VV is visual output) are retained.

Webpage structure alignment is specifically evaluated using TreeBLEU: TreeBLEU=S(t)S(t^)S(t^)\text{TreeBLEU} = \frac{|S(t) \cap S(\hat{t})|}{|S(\hat{t})|} with S()S(\cdot) denoting the set of 1-height subtrees and t,t^t, \hat{t} as candidate/reference parse trees.

Editor’s term: Multi-strategy curation refers to the use of overlapping synthesis, translation, and validation steps for increased coverage and robustness.

4. Data Generation, Cross-Modal Synergy, and Quality Assessment

Synthesis strategies are chosen per domain and combined to maximize sample diversity and task coverage. Strategies are as follows:

Domain Synthesis Strategies
Visualization Guided Evolution, Reverse Instruction, Re-contextualization
Animation/Artifacts Guided Evolution, Bidirectional Translation
WebUI Guided Evolution, Re-contextualization, Reverse Instruction
Scientific Demonstrations Bidirectional Translation, Guided Evolution

Validation encompasses:

  • Sandbox execution of every code sample in contextually matched environments (e.g., Python for visualization, Manim for animation, Mathematica for demonstrations, Playwright for web rendering).
  • VLM-based chain-of-thought reward modeling to assess alignment between instruction, code, and rendered visual.
  • Filtering via the multidimensional score detailed above.
  • Automated and, for some tasks (e.g., DTVBench), human annotation applying explicit rubrics for subjective benchmarks.

Cross-domain synergy leverages translation to expand and bridge coverage—e.g., converting scientific code (R, Matlab) to Manim or Wolfram Language animations, and cross-pollinating UI and SVG data.

5. Model Training, Evaluation Protocols, and Benchmarks

JanusCode-800K powers the training of JanusCoder (text-centric) and JanusCoderV (fully multimodal) models. Backbone models (Qwen3, Qwen2.5-VL, InternVL3.5) are adapted to handle the corpus’s diverse input formats. The suite of evaluative tasks spans:

  • Visualization Generation (PandasPlotBench, ArtifactsBench)
  • Chart-to-Code/ChartMimic
  • Webpage Generation/Editing (WebCode2M, DesignBench)
  • Algorithmic Reasoning
  • Dynamic Theorem Visualization (DTVBench)
  • SVG/Vector Graphics Generation
  • Scientific Demonstrations (InteractScience)
  • General Code Problem Solving (BigCodeBench, LiveCodeBench)

Performance is assessed using granular metrics:

  • Text-centric: Error rate (incorrect code %), mean task, and visual scores (1–100 scale).
  • Multimodal: Executability, visual fidelity (CLIP similarity), code structure alignment (TreeBLEU), programmatic correctness (pass rates), VLM/human scoring.
  • Animation/Theorem Visualization: Total Score combining executability, code similarity, instruction alignment, human faithfulness: Total Score=sexec(ssim+salign+sfaith)\text{Total Score} = s_\text{exec} \cdot (s_\text{sim} + s_\text{align} + s_\text{faith})

6. Empirical Findings and Implications

Key results reveal that JanusCoder and JanusCoderV, trained on JanusCode-800K, surpass commercial baselines (e.g., GPT-4o, Claude, Gemini) in code structure correctness (TreeBLEU), visual fidelity, and instruction alignment. A plausible implication is that data synergy across programmatic and visual domains enhances cross-domain generalization; for instance, scientific PL data strengthens animation performance.

Reward modeling is identified as essential for quality assurance—simple execution checks are insufficient for verifying visual correctness. Models trained on JanusCode-800K exhibit broader, more balanced capabilities relative to both specialized (VisCoder) and general models (GPT-4o), a finding that generalizes across model backbones and parameter counts. This suggests the corpus’s contribution as a backbone-independent resource for future multimodal code intelligence research.

7. Broader Significance and Applications

JanusCode-800K constitutes a foundational dataset for unified code+vision AI, addressing the scarcity of large, high-quality multimodal code corpora. Its synthesis toolkit, systematic quality validation, and modality integration mark a distinctive advance over prior specialized or text-centric datasets. The corpus and associated models support programmatic visualization, scientific illustration, front-end prototyping, and educational animation, enabling models to flexibly operate via text, code, and visual inputs and closing the gap between perceptual and symbolic code intelligence. Availability of code and checkpoints ensures accessibility for further open-source research and scalable extension for emergent multimodal applications.

JanusCode-800K marks a significant milestone in establishing large, systematically curated resources for visual-programmatic AI, fostering model development that symmetrically integrates language, logic, and perception (Sun et al., 27 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to JanusCode-800K Corpus.