MAGMA-Edu: Self-Reflective Multimodal Edu Framework
- MAGMA-Edu is a self-reflective multi-agent generative framework that integrates text and diagrams for creating high-quality educational questions.
- It employs a two-stage iterative pipeline with generate–validate–reflect loops to refine both textual clarity and diagrammatic accuracy.
- The framework outperforms models like GPT-4o with marked improvements in text quality and image–text consistency across benchmark evaluations.
MAGMA-Edu is a self-reflective multi-agent generative framework designed for automated construction of high-quality multimodal educational questions, with structured integration of textual and diagrammatic components. The system addresses pedagogical demands for precise, semantically aligned educational illustrations, particularly in mathematical contexts where traditional multimodal LLMs have shown substantial limitations in producing cohesive, accurate visual explanations. Unlike previous approaches that treat text and diagram generation as decoupled tasks, MAGMA-Edu unifies these processes through a two-stage, co-evolutionary pipeline employing generate–validate–reflect loops, establishing state-of-the-art performance across a suite of metrics on benchmark datasets (Wu et al., 24 Nov 2025).
1. High-Level System Architecture
MAGMA-Edu decomposes the educational question generation task into two sequential, co-evolving stages, each implemented via a lightweight team of specialized agents operating in bounded generate–validate–reflect cycles.
Stage 1: Textual Generation and Iterative Refinement
- Text Generator: Produces an initial draft based on instructional intent , where is the knowledge point, %%%%2%%%% the student level, and the diagram request.
- Text Validator: Scores the candidate text on six fine-grained metrics—Understanding Objective (UO), Linguistic Relevance (LR), Question Formulation (QF), Analytical Answer (AA), Conceptual Alignment (CA), and Image-Driven Quality (IDQ).
- Text Reflector: Processes validator feedback and generates a revision directive to guide subsequent drafts.
Stage 2: Programmatic Diagram Synthesis
- Code Generator: Maps the verified text together with diagram request into executable Matplotlib code.
- Code Executor: Executes the code, yielding the diagram image and execution status.
- Image Validator: Evaluates code syntax (), diagrammatic visual quality (), and textual–visual semantic alignment ().
- Image Reflector: Issues targeted code modifications to correct detected discrepancies.
Each stage iterates until its quality objective surpasses a threshold or a maximum number of iterations is reached.
2. Self-Reflective Mechanisms and Optimization
Each agent ensemble in both textual and visual stages is formulated as an optimizer over task-specific quality scores, structured as:
- Textual Loop Update:
where .
- Text Loss and Early Stopping:
Stopping is triggered when .
- Visual Loop Update:
Loop terminates at threshold or iteration cap .
Enforcement of Domain-Specific Constraints is realized via:
- Geometric fidelity through explicit coordinate transformations in code generation.
- Semantic alignment via OCR-based label matching, ensuring every quantitative datum in the text is mirrored in the diagram.
Iterative Pipeline Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Algorithm MAGMA-Edu(𝓘, I_max, τ_text, τ_visual): # Stage 1: Text Loop T ← TextGenerator(𝓘) for i in 1..I_max: feedback ← TextValidator(T) if feedback.score ≥ τ_text: break Δ_T ← TextReflector(T, feedback) T ← TextGenerator(𝓘 ∥ Δ_T) T* ← T # Stage 2: Diagram Loop G_code ← CodeGenerator(T*, r) for i in 1..I_max: exec_res, G_img ← CodeExecutor(G_code) vres ← ImageValidator(G_code, G_img, T*) if vres.visual_score ≥ τ_visual: break Δ_G ← ImageReflector(G_code, vres) G_code ← CodeGenerator(T*, r ∥ Δ_G) G* ← G_img return (T*, G*) |
3. Code-Based Intermediate Representation
All diagram rendering proceeds through a domain-specific, code-based intermediate representation, maximizing control and interpretability. The code is structured as follows:
- Grammar:
1 2 3 4 5 6 7 |
<Program> ::= import matplotlib.pyplot as plt
<Stmt>*
plt.axis('equal')
plt.savefig(path)
<Stmt> ::= <DrawLine> | <DrawText> | <RightAngleMarker> | ...
<DrawLine> ::= plt.plot([x1,x2],[y1,y2])
<DrawText> ::= plt.text(x,y,'label') |
- Example for Right Triangle Generation:
To draw a triangle with legs at :
This enables precise geometric constructions and facilitates semantic verification.
4. Experimental Protocol and Results
Dataset
- 78 knowledge points (granular K–12 mathematics)
- 5 questions per point, totaling 390 multimodal question–diagram pairs.
Evaluation Metrics
- Avg-Text: Arithmetic mean over UO, LR, QF, AA, CA, IDQ per question.
- Image–Text Consistency (ITC): Fraction of questions passing , , and checks.
Results Overview
| Model | Avg-Text | ITC |
|---|---|---|
| GPT-4o | 57.01 | 13.20 |
| MAGMA-Edu | 92.31 | 85.24 |
| Best Backbone (Gemini 2.5 Pro) | 96.20 | 99.12 |
- MAGMA-Edu outperforms raw GPT-4o by +35.3 percentage points in Avg-Text, and +72.0 percentage points in ITC.
- For all model backbones, MAGMA-Edu raises image/text consistency above 95% and achieves its best Avg-Text (96.20) with Gemini 2.5 Pro.
- Statistical significance is reported at (paired bootstrap) (Wu et al., 24 Nov 2025).
This indicates substantial improvements in both textual problem quality and visual–textual alignment compared to leading MLLMs.
5. End-to-End Example: Pythagorean Theorem Problem Generation
Instructional Intent
= "Pythagorean theorem" = "junior-high geometry" = "a right triangle with legs 3 cm and 4 cm; find the hypotenuse"
Stage 1 (Textual Loop)
- Initial generation:
1 2 3 4 5 6
{ "question_stem": "In right triangle ABC, AB=3 cm, AC=4 cm. Find BC.", "image_description": "Right triangle ABC, right angle at A, AB=3 cm, AC=4 cm.", "answer": "5 cm", "analysis": "By Pythagorean theorem, BC^2=3^2+4^2=25 ⇒ BC=5 cm." } - Validator passes all metrics; no reflection required.
Stage 2 (Diagrammatic Loop)
- Iteration 1:
- Validator detects missing "5 cm" label on the hypotenuse.
- Iteration 2:
- Passes all checks; output finalizes.
Final Output
Question: “In the right triangle ABC shown, AB = 3 cm and AC = 4 cm. Find the length of the hypotenuse BC.” Answer: 5 cm Explanation: “By the Pythagorean theorem,… hence BC = 5 cm.” Diagram: Correctly labeled right triangle annotating all three side lengths.
6. Significance and Implications
MAGMA-Edu exemplifies a structured approach to generating pedagogically sound multimodal educational content. The use of agent-based reflective loops, code-based intermediate representations, and explicit enforcement of geometric and semantic constraints address critical challenges in prior MLLMs—namely, poor text–image consistency and diagrammatic inaccuracy. The observed gains relative to baselines such as GPT-4o demonstrate the efficacy of multi-stage self-assessment for content alignment. This suggests that collaborative agent architectures with explicit quality control mechanisms provide a robust foundation for reliable, high-precision educational problem authoring (Wu et al., 24 Nov 2025).