Code2Video: Code-Driven Educational Videos

Updated 7 October 2025

Code2Video is a code-centric paradigm that generates educational videos using executable code to deliver discipline-specific content with high fidelity.
It employs a tri-agent architecture—Planner, Coder, and Critic—to structure content, synthesize and auto-fix code, and refine visual layouts.
The framework demonstrates a 40% efficiency gain over direct code generation and strong performance on metrics like VLM-as-a-Judge and TeachQuiz.

Code2Video is a code-centric paradigm designed for generating educational videos via executable code, with an emphasis on discipline-specific knowledge transmission, precise visual structure, and coherent transitions. The approach rests on the concept that manipulating a renderable environment through logically controlled, executable commands (e.g., Python code for Manim) offers superior fidelity and interpretability compared to pixel-space generative models. The framework is modular, consisting of three cooperating agents—Planner, Coder, and Critic—that orchestrate structured content generation, robust code synthesis with auto-fix, and rigorous visual layout refinement. Evaluation on the MMMC benchmark demonstrates a quantitative advance over direct code-generation approaches, supported by a dedicated set of metrics including VLM-as-a-Judge scores and the novel TeachQuiz knowledge reacquisition test.

1. Tri-Agent Framework Architecture

The Code2Video architecture comprises three specialized agents:

Planner: Interprets the initial learning request, decomposes it into a temporally coherent outline ( $\mathcal{O} = \{o_1,\ldots,o_n\}$ ), and constructs a storyboard for each section. This includes the detailed lecture lines and associated diagrammatic and animation cues. It retrieves relevant visual assets from an external database ( $\mathcal{D}$ ), caching selections for reuse and curricular consistency. The process is formalized as $\mathcal{O} \leftarrow P_{\text{outline}}(\mathcal{Q})$ and $a_i \leftarrow P_{\text{asset}}(s_i)$ .
Coder: Converts the Planner's storyboard into executable Python code using Manim syntax. Code synthesis is partitioned by section ( $c_i = P_{\text{coder}}(s_i, \mathcal{A})$ $c_{i} = P_{coder} (s_{i}, A)$ ), enabling parallelization. The key innovation is the ScopeRefine auto-fix procedure:
- Line scope: Attempts $K_1$ local repairs in context of the error line ±1 lines.
- Block scope: Updates a broader context block ( $\mathcal{B}_{i,j}$ ), with up to $K_2$ repairs.
- Global scope: When local fixes fail, the entire section code $c_i$ is regenerated from its storyboard $s_i$ .
- This hierarchical repair minimizes token usage and latency, yielding a 40% efficiency gain over direct code generation.
Critic: Post-processes the rendered video by applying a vision-LLM (VLM) conditioned on a Visual Anchor Prompt—a spatial grid discretization (6×6 fixed coordinates). The Critic detects occlusions, overlaps, and layout inconsistencies, recommending edits for precise visual structure. This enforces reproducibility and clarity in instructional videos targeted for knowledge transmission.

The agents operate sequentially but interdependently, forming a pipeline that spans curricular design, programmatic rendering, and aesthetic judgment.

2. Educational Content Planning and Asset Integration

Effective educational video generation depends on content structuring and asset retrieval:

Outline Generation: The Planner leverages the initial query ( $\mathcal{Q}$ ) to construct an outline $\mathcal{O}$ that is context-aware (discipline, audience level) and pedagogically sound (progressive logic, topical relevance).
Storyboard Construction: Each outline item $o_i$ is expanded with specific lecture lines and annotated actions (animation, transitions), serving as the blueprint for code synthesis.
Asset Management: Integration with an external visual asset database ( $\mathcal{D}$ ), managed by prompts $P_{\text{asset}}$ , allows for the automatic selection and caching of images, diagrams, or other illustrative media. This systematic reuse ensures both visual consistency and factual accuracy in educational sequences.

This structured planning phase is pivotal in differentiating Code2Video from generative approaches that lack explicit curricular organization.

3. Programmatic Code Generation and Auto-Fix

The Coder agent translates high-level structured instructions into executable Python code for the Manim engine:

Parallelism: By partitioning the storyboard across sections, code generation for $c_i$ is parallelized, enhancing scalability.
ScopeRefine Algorithm: The scope-guided auto-fix mechanism is a hierarchical debugging procedure:
- Local: Isolate and repair at the line level, minimizing computational cost.
- Block: Expand context if needed.
- Global: As a last resort, regenerate entire code sections.
Efficiency: The iterative, scope-aware repair strategy reduces overall token and runtime overhead, as substantiated by benchmarking.

This systematic code synthesis and repair process ensures robustness, reducing manual intervention and failure rates commonly associated with LLM-based code generation.

Ensuring pedagogical clarity and visual coherence is delegated to the Critic:

Visual Anchor Prompt: The Critic leverages a 2D spatial grid (6×6 anchor coordinates), providing a fixed referential space for element placement and layout refinement.
Vision-LLM Evaluation: Using a VLM, the Critic assesses five dimensions—Element Layout, Attractiveness, Logic Flow, Visual Consistency, and Accuracy & Depth—on a standardized 100-point scale per aspect. Detected faults (e.g., overlap, occlusion) trigger corrective actions in spatial arrangement.
Iterative Layout Adjustment: Refinement continues until spatial and visual metrics are satisfied. The process ensures each lecture component is optimally presented for knowledge acquisition.

This approach effectively addresses the dual challenge of temporal coherence and spatial clarity in educational content.

5. Evaluation Metrics and Benchmarking (MMMC and TeachQuiz)

Code2Video is evaluated on the MMMC benchmark, which features professionally crafted, discipline-specific educational videos. Metrics include:

VLM-as-a-Judge: Automated scoring by vision-LLMs on five aesthetic dimensions; ratings are averaged for an overall quality measure.
Code Efficiency: Average code generation time per topic, and token utilization are reported to assess system scalability.
TeachQuiz: An end-to-end knowledge reacquisition metric comprising:
- Unlearning: The student model is asked to answer queries after prior knowledge is withheld.
- Learning-from-Video: The same model is exposed to the generated video and re-tested.
- The metric $TQ(K, V_a) = S_2(K, V_a) - S_1(K)$ captures differential knowledge gain attributed to video viewing.

Empirical results show a 40% improvement in aggregate metrics over direct code generation, and equivalence to human-crafted tutorials in VLM and TeachQuiz evaluations.

6. Implications and Future Research Avenues

The Code2Video paradigm demonstrates that code-centric, agent-based methods yield scalable, interpretable, and pedagogically effective instructional videos. The tri-agent modular architecture provides fine-grained control over content and presentation, addressing shortcomings of pixel-based approaches. A plausible implication is that further scaling of such frameworks could automate large-scale video curriculum production across diverse domains.

Future directions specified include:

Extension to broader academic and interdisciplinary topics.
Optimization for lightweight, interactive scenarios.
Refinement of automated vs. manual aesthetic controls.
Integration of human attention modeling for viewer engagement.
Enhanced asset filtering and management for abstract topics.

7. Comparative Context and Field Impact

Code2Video diverges from prior video generation models—such as direct text-driven diffusion pipelines or combined transformer/U-Net architectures—by foregrounding executable code and agent collaboration in a modular pipeline. This structured, interpretable approach addresses key educational requirements: content depth, precise structure, and knowledge transfer efficacy. The framework’s demonstrated metrics (AES, TeachQuiz) on the MMMC benchmark signal a robust foundation and set a performance baseline for future programmatic educational video synthesis (Chen et al., 1 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Code2Video: A Code-centric Paradigm for Educational Video Generation (2025)

Follow Topic

Get notified by email when new papers are published related to Code2Video.

Code2Video: Code-Driven Educational Videos

1. Tri-Agent Framework Architecture

2. Educational Content Planning and Asset Integration

3. Programmatic Code Generation and Auto-Fix

4. Spatial Layout and Aesthetic Refinement

5. Evaluation Metrics and Benchmarking (MMMC and TeachQuiz)

6. Implications and Future Research Avenues

7. Comparative Context and Field Impact

Follow Topic

Continue Learning

Code2Video: Code-Driven Educational Videos

1. Tri-Agent Framework Architecture

2. Educational Content Planning and Asset Integration

3. Programmatic Code Generation and Auto-Fix

4. Spatial Layout and Aesthetic Refinement

5. Evaluation Metrics and Benchmarking (MMMC and TeachQuiz)

6. Implications and Future Research Avenues

7. Comparative Context and Field Impact

Follow Topic

Continue Learning

Related Topics