Papers
Topics
Authors
Recent
Search
2000 character limit reached

JanusCode-800K: Cross-Modal Synthetic Dataset

Updated 21 January 2026
  • JanusCode-800K is a large-scale, high-fidelity synthetic dataset comprising around 800,000 curated examples covering both text-centric and vision-centric code tasks.
  • It leverages guided evolution, reverse instruction, and bidirectional translation to generate diverse, validated samples, ensuring robust performance in code generation and visual-programmatic interface tasks.
  • Empirical evaluations reveal significant improvements in benchmark metrics and model transferability, highlighting its role in advancing both unimodal and multimodal model training.

JanusCode-800K is a large-scale, high-fidelity synthetic dataset designed to advance machine learning research in both text-centric and multimodal (vision-centric) code intelligence. Comprising approximately 800,000 curated examples, it provides comprehensive coverage across natural language instructions, algorithmic reasoning, executable code solutions, and, where applicable, visual artifacts derived from code. This dataset underpins state-of-the-art model training in code generation, multimodal program synthesis, and visual-programmatic interface tasks, enabling significant improvements in benchmark performance and architectural transferability (Abed et al., 27 Oct 2025, Sun et al., 27 Oct 2025).

1. Dataset Composition and Organization

JanusCode-800K contains a balanced mixture of text-centric and vision-centric samples, enabling broad coverage of coding and visual tasks. The dataset is split into approximately 407,400 text-centric examples and 392,500 vision-centric examples for a total of roughly 800,000 entries.

Text-centric modalities focus on:

  • Python visualization: code generation (127,500) and code editing (51,800)
  • General algorithmic code (100,000)
  • Scientific programming languages (R, Matlab, Mathematica; 31,800)
  • SVG graphics (20,000)
  • Code-driven animations (19,500)
  • Miscellaneous code artifacts (56,800)

Vision-centric modalities encompass:

  • Chart-to-code tasks (70,000)
  • Web UI generation (200,000) and editing (69,500)
  • Scientific demonstration visualizations (53,000)

All samples are classified along two orthogonal axes:

  • Task orientation: text-centric vs. vision-centric
  • Content domain: plots, animations, UIs, scientific demos

This taxonomy enforces domain coverage from basic plotting (Matplotlib) to complex Manim-style animations and interactive web interfaces (Sun et al., 27 Oct 2025).

2. Synthetic Data Generation and Curation Pipeline

The data generation methodology for JanusCode-800K integrates multiple strategies to ensure sample diversity, domain coverage, and alignment with human problem-solving.

2.1 Data Sources and Structuring

  • Seed problems are collected from competitive programming (e.g., LeetCode, Codeforces, AtCoder) and open corpora (StackV2, WebCode2M, scientific demonstration repositories, GitHub Manim scripts).
  • Large code files are decomposed via AST parsing to extract semantic units for granular sampling.

2.2 Multi-Strategy Sample Synthesis

  • Guided Evolution: Seed triplets (Instruction, Code, Visual) are evolved by injecting high-level concepts (e.g., "add interactive slider") and re-validating in a sandboxed environment until correctness is ensured.
  • Re-contextualization: For (I, C) pairs, implicit problem logic is extracted to generate richer instructions that thoroughly specify intent.
  • Reverse Instruction: Code snippets are sampled and instructions are induced, triggering regeneration of code and validation.
  • Bidirectional Translation: Cross-domain translation (e.g., Mathematica ↔ Manim) produces synthetic parallel pairs, aiding cross-modal generalization.
  • Pattern-Guided Expansion [Editor’s term]: Accepted web-mined content is expanded by seeding reasoning pattern prompts (e.g., sliding window, divide-and-conquer) to instructor models, ensuring broad algorithmic motif coverage (Abed et al., 27 Oct 2025).

2.3 Multi-Stage Validation and Mutation

  • Each candidate code sample is executed in sandboxed containers to verify test set conformance. Fitness for code is defined as the proportion of unit tests passed.
  • For vision-centric samples, successfully executing code must also yield the expected visual artifact.
  • Reward modeling with a VLM judges relevance, completion, code quality, and visual clarity on a 1–5 scale. Only samples with S≥τS \geq \tau (typically Ï„=4.0\tau=4.0) are retained (Sun et al., 27 Oct 2025).
  • A genetic algorithm introduces new task diversity via crossover (instruction recombination) and mutation (constraint tightening, input modification) at rate μ=0.1\mu=0.1, with periodic seed reinsertion and semantic deduplication (MiniLM, FAISS, Gemma-3) to maximize diversity.

3. Quadruplet Structure and Example

Each sample in JanusCode-800K for text-centric tasks is a quadruplet:

Component Description Example
Instruction Natural language task prompt "Given an array of integers, return the largest sum of any contiguous subarray."
Reasoning Stepwise chain-of-thought solution 1. Initialize max_current... 2. For each num... 3. ...
Solution Fully functional Python code def max_subarray(nums): ...
Tests Executable unit tests import unittest ...

For vision-centric examples, a visual artifact is also included as a first-class data field (e.g., plot, UI screenshot) (Sun et al., 27 Oct 2025).

This formalism directly scaffolds LLMs to learn both the solution and the underlying rationale for each task, a departure from datasets that record only input/output pairs.

4. Quality Control and Filtering

JanusCode-800K employs multiple, interlocking quality assessment mechanisms:

  • Code validation: Only code that runs and passes all designated tests is retained (f(c)=1f(c) = 1).
  • Visual artifact validation: For (Instruction, Code, Visual) triples, visual outputs are must match expected results.
  • Vision-LLM (VLM) rewards: Four axes (task relevance, completion, code quality, visual clarity), with aggregate S≥4.0S \geq 4.0 threshold.
  • Deduplication: Pairwise semantic similarity below 0.9 to maximize unique coverage.
  • Downstream validation metrics: ChartMimic (low/high-level matching), TreeBLEU for Web UI structure, DTVBench for animation faithfulness and alignment.

Ablation experiments show that reward-based filtering and data variety both contribute substantially to downstream benchmark performance, with removal of any major domain reducing accuracy by several percentage points (Sun et al., 27 Oct 2025).

5. Empirical Results and Model Training Outcomes

JanusCode-800K shows strong empirical impact on LLMs for code and multimodal code+vision tasks.

  • Fine-tuning Phi-2 (2.7B) with 25k JanusCode-800K samples yields HumanEval (Base) pass@1 of 56.1% (+10.4% absolute over base), matching or exceeding much larger models (e.g., CodeLlama-70B, Llama3-8B-Instruct) under identical sample budgets.
  • Generalization: Gains transfer similarly to other architectures (e.g., CodeGemma-2B, Qwen2.5-Coder-7B, InternVL3.5-4B).
  • Preserved reasoning: Domain-specific fine-tuning leaves broader reasoning benchmarks (HellaSwag, WinoGrande, MMLU) essentially unchanged, suggesting robust specialization does not reduce general capability.
  • Multimodal tasks: JanusCoderV-7B achieves 64.7%/72.8% (low/high) on ChartMimic, outperforming Qwen2.5-VL-7B, and a web UI TreeBLEU of 0.28.
  • Dynamic visualization: JanusCoder-8B, with JanusCode-800K, attains 9.70/15 DTVBench Manim, exceeding Qwen3-14B and GPT-4o on subsampled Wolfram tasks.
  • Ablations: Removal of specific data domains or reward filtering uniformly reduces performance across metrics.

6. Broader Implications and Design Principles

  • Substituting for scaling: Structured, reasoning-integrated synthetic datasets can yield gains equivalent to large increases in model scale; for instance, 2.7B + JanusCode-800K achieves performance on par with and occasionally surpassing 30B+ models.
  • Diversity: Sample diversity, ensured via web mining, mutation, and semantic deduplication, is critical: a homogeneous 5k subset underperforms a diverse subset by 6–7 percentage points on HumanEval.
  • Transferability: The construction framework is model-agnostic, porting to other languages requires only substituting runtime and test harness components (e.g., JAVA JUnit, C++ Google Test).
  • Multimodal synergy: Reciprocal transfer between text and vision domains (e.g., using Python plots to seed chart-to-code data) expands coverage efficiently and bolsters generalization capacity for LLMs tasked with cross-modal generation.

This suggests reasoning-centered, cross-modal synthetic corpora like JanusCode-800K afford a scalable, robust foundation for advancing both unimodal code intelligence and multimodal, vision-conditioned code synthesis (Abed et al., 27 Oct 2025, Sun et al., 27 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JanusCode-800K.