MAGMA-Edu: Self-Reflective Multimodal Edu Framework

Updated 1 December 2025

MAGMA-Edu is a self-reflective multi-agent generative framework that integrates text and diagrams for creating high-quality educational questions.
It employs a two-stage iterative pipeline with generate–validate–reflect loops to refine both textual clarity and diagrammatic accuracy.
The framework outperforms models like GPT-4o with marked improvements in text quality and image–text consistency across benchmark evaluations.

MAGMA-Edu is a self-reflective multi-agent generative framework designed for automated construction of high-quality multimodal educational questions, with structured integration of textual and diagrammatic components. The system addresses pedagogical demands for precise, semantically aligned educational illustrations, particularly in mathematical contexts where traditional multimodal LLMs have shown substantial limitations in producing cohesive, accurate visual explanations. Unlike previous approaches that treat text and diagram generation as decoupled tasks, MAGMA-Edu unifies these processes through a two-stage, co-evolutionary pipeline employing generate–validate–reflect loops, establishing state-of-the-art performance across a suite of metrics on benchmark datasets (Wu et al., 24 Nov 2025).

1. High-Level System Architecture

MAGMA-Edu decomposes the educational question generation task into two sequential, co-evolving stages, each implemented via a lightweight team of specialized agents operating in bounded generate–validate–reflect cycles.

Stage 1: Textual Generation and Iterative Refinement

Text Generator: Produces an initial draft based on instructional intent $\mathcal I = \{k, s, r\}$ , where $k$ is the knowledge point, %%%%2%%%% the student level, and $r$ the diagram request.
Text Validator: Scores the candidate text on six fine-grained metrics—Understanding Objective (UO), Linguistic Relevance (LR), Question Formulation (QF), Analytical Answer (AA), Conceptual Alignment (CA), and Image-Driven Quality (IDQ).
Text Reflector: Processes validator feedback and generates a revision directive $\Delta_T$ to guide subsequent drafts.

Stage 2: Programmatic Diagram Synthesis

Code Generator: Maps the verified text $T^*$ together with diagram request $r$ into executable Matplotlib code.
Code Executor: Executes the code, yielding the diagram image $G$ and execution status.
Image Validator: Evaluates code syntax ( $Q_{\rm syntax}$ ), diagrammatic visual quality ( $Q_{\rm visual}$ ), and textual–visual semantic alignment ( $Q_{\rm align}$ ).
Image Reflector: Issues targeted code modifications $\Delta_G$ to correct detected discrepancies.

Each stage iterates until its quality objective surpasses a threshold or a maximum number of iterations $I_{\max}$ is reached.

2. Self-Reflective Mechanisms and Optimization

Each agent ensemble in both textual and visual stages is formulated as an optimizer over task-specific quality scores, structured as:

Textual Loop Update:

$T^{(i+1)} = \mathcal{F}_{\text{text}}(k, s, r, \Delta_T^{(i)}), \qquad \Delta_T^{(i)} = \mathrm{Reflector}(T^{(i)}, \mathrm{Feedback}^{(i)})$

where $Q_{\text{text}}(T) = \tfrac{1}{6} (UO + LR + QF + AA + CA + IDQ)$ .

Text Loss and Early Stopping:

$\mathcal{L}_{\text{text}}(T) = \max(0, \tau_{\text{text}} - Q_{\text{text}}(T))$

Stopping is triggered when $Q_{\text{text}}(T^{(i)}) \ge \tau_{\text{text}}$ .

Visual Loop Update:

$G^{(i+1)} = \mathcal{F}_{\text{vision}}(T^*, r, \Delta_G^{(i)}), \qquad \Delta_G^{(i)} = \mathrm{ImgReflector}(G^{(i)}, \mathrm{ValResults}^{(i)})$

Loop terminates at threshold $Q_{\text{visual}}$ or iteration cap $I_{\max}$ .

Enforcement of Domain-Specific Constraints is realized via:

Geometric fidelity through explicit coordinate transformations in code generation.
Semantic alignment via OCR-based label matching, ensuring every quantitative datum in the text is mirrored in the diagram.

Iterative Pipeline Pseudocode:

Algorithm MAGMA-Edu(𝓘, I_max, τ_text, τ_visual):
  # Stage 1: Text Loop
  T ← TextGenerator(𝓘)
  for i in 1..I_max:
    feedback ← TextValidator(T)
    if feedback.score ≥ τ_text: break
    Δ_T ← TextReflector(T, feedback)
    T ← TextGenerator(𝓘 ∥ Δ_T)
  T* ← T

  # Stage 2: Diagram Loop
  G_code ← CodeGenerator(T*, r)
  for i in 1..I_max:
    exec_res, G_img ← CodeExecutor(G_code)
    vres ← ImageValidator(G_code, G_img, T*)
    if vres.visual_score ≥ τ_visual: break
    Δ_G ← ImageReflector(G_code, vres)
    G_code ← CodeGenerator(T*, r ∥ Δ_G)
  G* ← G_img

  return (T*, G*)

3. Code-Based Intermediate Representation

All diagram rendering proceeds through a domain-specific, code-based intermediate representation, maximizing control and interpretability. The code is structured as follows:

Grammar:

<Program> ::= import matplotlib.pyplot as plt
               <Stmt>*
               plt.axis('equal')
               plt.savefig(path)
<Stmt> ::= <DrawLine> | <DrawText> | <RightAngleMarker> | ...
<DrawLine> ::= plt.plot([x1,x2],[y1,y2])
<DrawText> ::= plt.text(x,y,'label')

Example for Right Triangle Generation:

To draw a triangle with legs $a, b$ at $A = (0, 0)$ :

$B = (a, 0),\quad C = (0, b),\quad \texttt{plt.plot([0, a, 0, 0], [0, 0, b, 0])}$

This enables precise geometric constructions and facilitates semantic verification.

4. Experimental Protocol and Results

Dataset

78 knowledge points (granular K–12 mathematics)
5 questions per point, totaling 390 multimodal question–diagram pairs.

Evaluation Metrics

Avg-Text: Arithmetic mean over UO, LR, QF, AA, CA, IDQ per question.
Image–Text Consistency (ITC): Fraction of questions passing $Q_{\rm syntax}$ , $Q_{\rm align}$ , and $Q_{\rm visual}$ checks.

Results Overview

Model	Avg-Text	ITC
GPT-4o	57.01	13.20
MAGMA-Edu	92.31	85.24
Best Backbone (Gemini 2.5 Pro)	96.20	99.12

MAGMA-Edu outperforms raw GPT-4o by +35.3 percentage points in Avg-Text, and +72.0 percentage points in ITC.
For all model backbones, MAGMA-Edu raises image/text consistency above 95% and achieves its best Avg-Text (96.20) with Gemini 2.5 Pro.
Statistical significance is reported at $p<0.01$ (paired bootstrap) (Wu et al., 24 Nov 2025).

This indicates substantial improvements in both textual problem quality and visual–textual alignment compared to leading MLLMs.

5. End-to-End Example: Pythagorean Theorem Problem Generation

Instructional Intent

$k$ = "Pythagorean theorem" $s$ = "junior-high geometry" $r$ = "a right triangle with legs 3 cm and 4 cm; find the hypotenuse"

Stage 1 (Textual Loop)

Initial generation:

{
  "question_stem": "In right triangle ABC, AB=3 cm, AC=4 cm. Find BC.",
  "image_description": "Right triangle ABC, right angle at A, AB=3 cm, AC=4 cm.",
  "answer": "5 cm",
  "analysis": "By Pythagorean theorem, BC^2=3^2+4^2=25 ⇒ BC=5 cm."
}

Validator passes all metrics; no reflection required.

Stage 2 (Diagrammatic Loop)

Iteration 1:
- Validator detects missing "5 cm" label on the hypotenuse.
Iteration 2:
- Passes all checks; output finalizes.

Final Output

Question: “In the right triangle ABC shown, AB = 3 cm and AC = 4 cm. Find the length of the hypotenuse BC.” Answer: 5 cm Explanation: “By the Pythagorean theorem,… hence BC = 5 cm.” Diagram: Correctly labeled right triangle annotating all three side lengths.

6. Significance and Implications

MAGMA-Edu exemplifies a structured approach to generating pedagogically sound multimodal educational content. The use of agent-based reflective loops, code-based intermediate representations, and explicit enforcement of geometric and semantic constraints address critical challenges in prior MLLMs—namely, poor text–image consistency and diagrammatic inaccuracy. The observed gains relative to baselines such as GPT-4o demonstrate the efficacy of multi-stage self-assessment for content alignment. This suggests that collaborative agent architectures with explicit quality control mechanisms provide a robust foundation for reliable, high-precision educational problem authoring (Wu et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation (2025)

MAGMA-Edu: Self-Reflective Multimodal Edu Framework

1. High-Level System Architecture

2. Self-Reflective Mechanisms and Optimization

3. Code-Based Intermediate Representation

4. Experimental Protocol and Results

5. End-to-End Example: Pythagorean Theorem Problem Generation

6. Significance and Implications

Whiteboard

Follow Topic

Continue Learning

MAGMA-Edu: Self-Reflective Multimodal Edu Framework

1. High-Level System Architecture

2. Self-Reflective Mechanisms and Optimization

3. Code-Based Intermediate Representation

4. Experimental Protocol and Results

5. End-to-End Example: Pythagorean Theorem Problem Generation

6. Significance and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics