JanusCode-800K Multimodal Code Corpus
- JanusCode-800K is a comprehensive multimodal dataset merging instructional text, executable code, and rendered visuals for unified model training.
- It uses a hierarchical synthesis and quality control pipeline to ensure high code execution fidelity, visual clarity, and accurate instruction alignment.
- The corpus underpins models like JanusCoder and JanusCoderV, demonstrating improved performance in code structure, visual fidelity, and cross-domain generalization.
JanusCode-800K is a large-scale, multimodal code corpus designed to enable the training and benchmarking of unified visual-programmatic models. Comprising over 800,000 samples, JanusCode-800K addresses acute bottlenecks in scalable data synthesis, quality assessment, and cross-domain coverage for tasks at the intersection of code intelligence and visual reasoning. It underpins the development of JanusCoder and JanusCoderV, models that extend code generation and understanding to both text-centric and vision-centric inputs, supporting comprehensive evaluation across instructional, visual, and programmatic axes.
1. Corpus Structure and Coverage
JanusCode-800K constitutes a broad collection of multimodal samples that systematically merge instructional text, executable source code, and rendered visual output. The construction spans a diverse programmatic landscape, including classic charts and visualizations, interactive web UIs, code-driven animations, scientific demonstrations, and algorithmic artifacts. Distinct modalities are organized as follows:
| Data Type | # Samples |
|---|---|
| Python Visualization: Generation | 127.5K |
| Python Visualization: Editing | 51.8K |
| SVG | 20.0K |
| Animation | 19.5K |
| General Artifacts | 56.8K |
| Algorithm Data | 100.0K |
| Scientific PLs | 31.8K |
| Chart-to-Code | 70.0K |
| WebUI Generation | 200.0K |
| WebUI Editing | 69.5K |
| Scientific Demonstration | 53.0K |
Both text-centric tasks (e.g., instruction-to-code) and vision-centric tasks (e.g., chart-to-code and webpage editing) are robustly represented. This distribution promotes balanced training for unified code intelligence models that must reason across modalities.
2. Modalities: Data Types, Tasks, and Programming Languages
JanusCode-800K provides comprehensive support for a range of input and output modalities across three axes:
Visual Outputs include static charts (e.g., Matplotlib, Seaborn), SVG vector graphics, Manim/3Blue1Brown-style animations, interactive web applications (HTML, CSS, JavaScript), scientific demonstrations (Wolfram Mathematica, R, Matlab), and general artifacts such as games and management systems.
Code Modalities span Python (emphasizing data and visualization), scientific languages (Matlab, Wolfram Language, R), web technologies (HTML, CSS, JavaScript), SVG, and domain-specific animation scripts (Manim).
Supported Tasks encompass instruction-to-code, chart-to-code, webpage editing/generation (including image/screen-to-code), algorithmic problem solving, scientific code translation, and bidirectional translation between domains and languages.
This breadth ensures JanusCode-800K supports not only standard code-generation and reasoning tasks but also complex, visually entangled tasks required for effective visual-programmatic interfaces.
3. Synthesis Toolkit and Data Curation Pipeline
The JanusCode-800K corpus is synthesized using a hierarchical toolkit designed to maximize data quality and coverage. The synthesis pipeline incorporates:
- Data Collection: Aggregation from large-scale public datasets (e.g., StackV2, WebCode2M), web crawls, and specialized sources such as Wolfram Demonstrations.
- Data Curation: Multifaceted strategies transform and extend raw samples:
- Guided Evolution promotes diversity via meta-task-driven mutation of instruction/code pairs.
- Re-contextualization improves instruction-code alignment and increases semantic richness.
- Reverse Instruction generates natural language instructions for code-only assets.
- Bidirectional Translation amplifies cross-domain and cross-lingual coverage.
- Quality Control: Samples undergo:
- Automated code execution within sandboxed environments corresponding to their language and runtime.
- LLM or Vision LLM (VLM)-based multi-objective reward modeling and scoring, with each dimension (task relevance, completeness, code quality, visual clarity) scored 1–5.
- Only samples achieving score threshold (where is instruction, is code, is visual output) are retained.
Webpage structure alignment is specifically evaluated using TreeBLEU: with denoting the set of 1-height subtrees and as candidate/reference parse trees.
Editor’s term: Multi-strategy curation refers to the use of overlapping synthesis, translation, and validation steps for increased coverage and robustness.
4. Data Generation, Cross-Modal Synergy, and Quality Assessment
Synthesis strategies are chosen per domain and combined to maximize sample diversity and task coverage. Strategies are as follows:
| Domain | Synthesis Strategies |
|---|---|
| Visualization | Guided Evolution, Reverse Instruction, Re-contextualization |
| Animation/Artifacts | Guided Evolution, Bidirectional Translation |
| WebUI | Guided Evolution, Re-contextualization, Reverse Instruction |
| Scientific Demonstrations | Bidirectional Translation, Guided Evolution |
Validation encompasses:
- Sandbox execution of every code sample in contextually matched environments (e.g., Python for visualization, Manim for animation, Mathematica for demonstrations, Playwright for web rendering).
- VLM-based chain-of-thought reward modeling to assess alignment between instruction, code, and rendered visual.
- Filtering via the multidimensional score detailed above.
- Automated and, for some tasks (e.g., DTVBench), human annotation applying explicit rubrics for subjective benchmarks.
Cross-domain synergy leverages translation to expand and bridge coverage—e.g., converting scientific code (R, Matlab) to Manim or Wolfram Language animations, and cross-pollinating UI and SVG data.
5. Model Training, Evaluation Protocols, and Benchmarks
JanusCode-800K powers the training of JanusCoder (text-centric) and JanusCoderV (fully multimodal) models. Backbone models (Qwen3, Qwen2.5-VL, InternVL3.5) are adapted to handle the corpus’s diverse input formats. The suite of evaluative tasks spans:
- Visualization Generation (PandasPlotBench, ArtifactsBench)
- Chart-to-Code/ChartMimic
- Webpage Generation/Editing (WebCode2M, DesignBench)
- Algorithmic Reasoning
- Dynamic Theorem Visualization (DTVBench)
- SVG/Vector Graphics Generation
- Scientific Demonstrations (InteractScience)
- General Code Problem Solving (BigCodeBench, LiveCodeBench)
Performance is assessed using granular metrics:
- Text-centric: Error rate (incorrect code %), mean task, and visual scores (1–100 scale).
- Multimodal: Executability, visual fidelity (CLIP similarity), code structure alignment (TreeBLEU), programmatic correctness (pass rates), VLM/human scoring.
- Animation/Theorem Visualization: Total Score combining executability, code similarity, instruction alignment, human faithfulness:
6. Empirical Findings and Implications
Key results reveal that JanusCoder and JanusCoderV, trained on JanusCode-800K, surpass commercial baselines (e.g., GPT-4o, Claude, Gemini) in code structure correctness (TreeBLEU), visual fidelity, and instruction alignment. A plausible implication is that data synergy across programmatic and visual domains enhances cross-domain generalization; for instance, scientific PL data strengthens animation performance.
Reward modeling is identified as essential for quality assurance—simple execution checks are insufficient for verifying visual correctness. Models trained on JanusCode-800K exhibit broader, more balanced capabilities relative to both specialized (VisCoder) and general models (GPT-4o), a finding that generalizes across model backbones and parameter counts. This suggests the corpus’s contribution as a backbone-independent resource for future multimodal code intelligence research.
7. Broader Significance and Applications
JanusCode-800K constitutes a foundational dataset for unified code+vision AI, addressing the scarcity of large, high-quality multimodal code corpora. Its synthesis toolkit, systematic quality validation, and modality integration mark a distinctive advance over prior specialized or text-centric datasets. The corpus and associated models support programmatic visualization, scientific illustration, front-end prototyping, and educational animation, enabling models to flexibly operate via text, code, and visual inputs and closing the gap between perceptual and symbolic code intelligence. Availability of code and checkpoints ensures accessibility for further open-source research and scalable extension for emergent multimodal applications.
JanusCode-800K marks a significant milestone in establishing large, systematically curated resources for visual-programmatic AI, fostering model development that symmetrically integrates language, logic, and perception (Sun et al., 27 Oct 2025).