Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 83 tok/s
Gemini 2.5 Flash 150 tok/s Pro
Gemini 2.5 Pro 48 tok/s Pro
Kimi K2 190 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

JanusCoder: Unified Multimodal Code System

Updated 3 November 2025
  • JanusCoder is a unified multimodal code intelligence system combining text, code, and visual inputs to generate and edit complex visual artifacts.
  • It employs transformer architectures and reward-modeling techniques on an 800K-sample dataset to ensure high task relevance and code quality.
  • The system supports cross-domain reasoning, enabling text-to-code, image-to-code, and interactive visual editing for varied applications.

JanusCoder is a foundational code intelligence system that integrates textual instructions, code, and visual modalities within a unified model architecture. Its design targets complex neural code intelligence scenarios—ranging from generating code for standard charts and interactive web UIs to program-driven visual editing and scientific animations—by establishing a visual-programmatic interface that accepts both text and vision as inputs and directly reasons about their programmatic logic and corresponding visual outputs.

1. Architectural Design and Unified Multimodal Pipeline

JanusCoder is based on large-scale LLMs (Qwen3, 8B/14B parameters), with extensions for vision (JanusCoderV uses InternVL3.5-8B and Qwen2.5-VL-7B-Instruct). The model supports arbitrary combinations of textual instructions, code snippets, and visual inputs (e.g. screenshots, images) as input, integrating them into a single transformer stack for causal language modeling: L=E(I,C,V)[t=1TlogP(ctc<t,I,V)]\mathcal{L} = \mathbb{E}_{(I, C, V)} \left[ -\sum_{t=1}^{T} \log P(c_t | c_{<t}, I, V) \right] where II is text, CC is code, VV is the visual input. The architecture is designed for unified inference on text-centric (instruction/code), vision-centric (visual/code), and multimodal input triplets.

JanusCoderV’s visual encoder processes images and outputs a token stream that is concatenated with instruction/code for the transformer; the model is trained to generate code conditioned on these composite contexts, directly enabling vision-to-code tasks (e.g., “chart mimicry” or UI recreation from screenshots).

2. Multimodal Data Synthesis Toolkit and JanusCode-800K

A cornerstone of JanusCoder is the large-scale synthesis toolkit for producing high-quality multimodal code data across domains:

  1. Data Sourcing: Aggregates data from public repositories (StackV2, WebCode2M), domain-focused archives (Wolfram Demonstrations, Manim scripts), and languages (Python, R, Mathematica, Matlab).
  2. Guided Evolution: Mutates seed samples (instruction, code, visual) with model-driven transformations—e.g., extending web UIs, modifying chart properties, adding widgets to animation code.
  3. Re-contextualization: Upgrades or clarifies instructions using model feedback for tighter code-instruction alignment.
  4. Reverse Instruction: Generates plausible instructions for code-only samples, increasing labeled coverage.
  5. Bidirectional Translation: Translates instructions/code/tasks across domains (e.g., chart/animation tasks between Manim, Mathematica, Matlab, etc.), enabling cross-system generalization.
  6. Automated Quality Control: Each sample is evaluated using a reward model—incorporating task relevance, code quality, completion, visual clarity:

S=R(I,C,V)=14(Task Relevance+Completion+Code Quality+Visual Clarity)S = R(I, C, V) = \frac{1}{4} \left( \text{Task Relevance} + \text{Completion} + \text{Code Quality} + \text{Visual Clarity} \right)

Only high-scoring samples are retained for training.

This synthesis pipeline yields JanusCode-800K, an 800,000-sample multimodal dataset with balanced coverage:

  • Python visualization: 180K samples
  • WebUI: >270K
  • Chart-to-code: 70K
  • Animation: 19,500
  • Scientific PLs: 31,800
  • Algorithms: 100K+

JanusCoder is trained on text-centric subsets; JanusCoderV trains on the full multimodal dataset.

3. Visual-Programmatic Interface and Task Coverage

Unique to JanusCoder is its generalist, scalable handling of coding tasks involving both program logic and visual semantics:

  • Text-to-code: Generate Python, R, or web code for visual artifact creation (charts/plots, UIs, dynamic animations) based on textual instructions.
  • Image-to-code (vision-centric): Given a visual artifact (e.g., chart image, UI screenshot), generate the corresponding source code to reproduce it.
  • Visual editing: Modify program output based on visual cues (e.g., “change all buttons to blue” in a screenshot-driven web code editing task).
  • Cross-domain synthesis: Transfer logic from covered domains (e.g., abstract algorithmic tasks in R/Matlab to Manim/Wolfram for animation).
  • Interactive science demonstration: Synthesis and editing in scientific visualization contexts (Manim, Mathematica).

JanusCoder's reward-modeled data pipeline and cross-domain strategy distinguishes it from specialist models built only for chart-to-code, UI editing, or scientific animation tasks; it enables transfer and generalization in subdomains otherwise limited by data scarcity.

4. Experimental Results and Evaluation

JanusCoder models (7B, 8B, 14B parameters) and JanusCoderV variants are benchmarked on diverse, high-complexity multimodal code intelligence datasets:

  • Text-to-code (PandasPlotBench, ArtifactsBench, DTVBench):
    • JanusCoder-14B attains error rates of 9.7% on PandasPlotBench, matching GPT-4o and outperforming other open-weight models.
    • Visual correctness and instruction alignment scores are superior to non-vision baselines on ArtifactsBench.
  • Vision-centric/image-to-code (ChartMimic, WebCode2M, InteractScience, DesignBench):

    • JanusCoderV-7B/8B exceeds GPT-4o and chart-to-code specialists on low- and high-level performance metrics in ChartMimic.
    • WebCode2M shows JanusCoder-7B reaching the highest TreeBLEU (structural similarity of DOMs):

    TreeBLEU=S(t)S(t^)S(t^)\text{TreeBLEU} = \frac{|S(t) \cap S(\hat{t})|}{|S(\hat{t})|}

    where S()S(\cdot) counts subtrees, tt is predicted, t^\hat{t} is reference. - InteractScience benchmarks show JanusCoderV outperforming all open baselines for programmatic and visual correctness.

  • Ablations: Data category removal or lack of reward modeling (vs. executability-only filtering) significantly reduces scores, confirming the importance of multimodal synergies and quality modeling.
  • Model transfer robustness: Applying the JanusCode-800K pipeline to weaker LLM backbones (Qwen2.5-Coder, InternVL3.5-4B) results in improved generalization, suggesting robustness to backbone selection.
Model Error Rate (%) Visual Score Task Score
Qwen3-8B 20.0 63 74
JanusCoder-8B 14.9 63 80
JanusCoder-14B 9.7 67 86
GPT-4o (proprietary) 9.7 72 85

5. Core Methodological Insights

Key findings from JanusCoder research include:

  • Cross-domain transfer between domains (e.g., R/Matlab logic applied to Manim/Wolfram) is required for task generalization when labeled data is sparse.
  • Reward modeling—explicit rating of samples for clarity, correctness, and task relevance—yields significantly higher model performance than filtering solely by executable output.
  • Unified, multimodal architecture prevents compartmentalization and scaling issues inherent to domain-specialist models, facilitating cross-task and cross-domain reasoning.
  • AST-based structured learning: Ingesting large code files via abstract syntax tree decomposition enables granular, annotated learning from complex scripts, further facilitating multi-step orchestration and editing tasks.

6. Practical Impact and Significance

JanusCoder and JanusCoderV deliver open-source, foundational models for multimodal code intelligence, directly enabling programmatic generation and editing for visual artifacts in a broad spectrum of scientific and creative coding domains. The models’ capacity to harmonize instructional logic, code correctness, and visual fidelity—and their robust generalization across domains, modalities, and backbones—marks a significant advance in neural code intelligence. All code, model checkpoints, and corpus data are publicly available (https://github.com/InternLM/JanusCoder).

JanusCoder’s approach demonstrates that reward-modeled, large-scale, cross-domain multimodal data, paired with scalable backbone LLMs and photo-realistic vision encoders, are essential for closing the gap between programmatic logic and its visual expression in code intelligence.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to JanusCoder.