EduVidQA: Multimodal Educational QA Benchmark
- EduVidQA is a multimodal benchmark dataset featuring over 5,000 QA pairs derived from real and synthetic computer science lectures with precise temporal context.
- It employs rigorous methods like regex-based mining, expert verification, and adversarial filtering to ensure high-quality, contextually grounded data.
- Benchmarking experiments reveal robust gains from synthetic fine-tuning, enabling both open-source and proprietary models to generate context-rich educational answers.
The EduVidQA dataset is a benchmark corpus and evaluation framework designed to support the development and assessment of Multimodal LLMs (MLLMs) for answering student questions grounded in lecture video content. Focused on online computer science and artificial intelligence lectures, EduVidQA provides both real-world and synthetic question–answer (QA) pairs with rigorous temporal and pedagogical grounding, facilitating comprehensive research on long-form educational answer generation and evaluation based on real educational artifacts.
1. Dataset Construction and Composition
The EduVidQA dataset consists of 5,252 question–answer pairs collected from 296 publicly available computer science lecture videos. Its construction is bifurcated into two principal sources:
- Real-world QA Pairs (270): These pairs originate from YouTube lecture videos. Candidate questions are first mined using regular expressions to detect questions (e.g., via the presence of question marks). To ensure knowledge-seeking intent and high temporal fidelity, each question is then manually reviewed and filtered for temporal grounding (i.e., explicit or implicit reference to specific video segments) and pedagogical relevance. Exemplar answers are generated and subsequently verified or revised by domain experts without the involvement of LLMs to avoid model-induced bias.
- Synthetic QA Pairs (4,982): To mitigate annotation bottlenecks and ensure scale, an extensive synthetic QA subset is generated using manually curated transcripts from NPTEL computer science courses and the GPT-4o model. Automated QA generation produces initial candidates, which are then filtered in multiple stages:
- Removal of questions lacking timestamp grounding.
- An adversarial refinement step: questions are only retained if their answers require video context, operationalized via an entailment score threshold of 0.65 (i.e., questions that can be answered without context are discarded).
- Each QA pair receives automatic annotation for difficulty using Bloom’s taxonomy, grouped into "easy", "medium", and "hard" categories.
The overall dataset thus captures a diverse spectrum of both naturally occurring and synthetic QA, explicitly accounting for variation in cognitive demand, difficulty, and student interests.
2. Data Processing Methodology
The methodology for curation and quality control involves:
- Mining and Filtering: Initial collection of candidate questions using regex patterns; manual filtering for knowledge-seeking intent.
- Verification: Expert review of answers to ensure accuracy and completeness, especially for the real-world data.
- Synthetic Data Generation and Filtering: GPT-4o is used to generate QA pairs from transcripts, with downstream filtering based on presence of timestamps and context entailment.
- Metadata Annotation: Automated assignment of Bloom’s taxonomy-based cognitive difficulty tags; these are subsequently grouped and partially validated manually.
- Temporal Context Extraction: To maintain pedagogical relevance, each QA pair is associated with a context window (~4 minutes around the referenced timestamp), including both textual transcripts and sampled visual frames from the video.
This pipeline collectively yields a high-quality, temporally grounded, and pedagogically diverse dataset designed to facilitate realistic MLLM question answering.
3. Model Benchmarking and Evaluation Protocols
EduVidQA is used to benchmark six state-of-the-art MLLMs, including both open-source and proprietary models:
- Open-source video LLMs: mPLUG-Owl 3–8B, Video LLaVA-7B
- Large Vision-LLMs (LVLMs): Qwen2VL-7B, Llava-13B
- Proprietary foundation models: GPT-4o, Gemini 1.5
Experiments are conducted on both the synthetic and real-world test sets. Fine-tuning is performed via supervised LoRA on the synthetic data, and all models are evaluated for their ability to generate context-rich, long-form answers. A key result is that models fine-tuned on the synthetic data see robust improvements across all reported text-based metrics. In some cases, low-parameter, open-source models approach or outperform substantially larger proprietary models.
Despite these advances, the task remains nontrivial due to the need for temporal grounding, spatio-temporal reasoning (even with multimodal input), and alignment with instructional intent rather than surface-level correctness.
4. Evaluation Metrics: Text-based and Qualitative
EduVidQA employs a two-pronged evaluation approach:
- Text-based Metrics: These include n-gram-based metrics (BLEU, ROUGE-L, METEOR), semantic entailment score, and LLM-based FactQA Precision/Recall. Such metrics capture factual and surface-level correspondence but are insufficient for deeper educational quality.
- Qualitative Metrics (Student Preference Grounded): To reflect the true educational utility of model-generated answers, the dataset incorporates human and model preference studies that focus on three axes:
- Clarity: The absence of jargon and ambiguity, with a logical structure.
- Encouraging Critical Thinking (ECT): The presence of prompts for further inquiry, open-ended questions, or alternative perspectives.
- Using Pedagogical Techniques (UPT): The deployment of analogies, exemplification, and stepwise explanations.
Qualitative evaluation uses human annotation with Likert-scale scoring and cross-validates with GPT-4o–based assessments, reporting measures such as Spearman’s correlation and Cohen’s Kappa for alignment evaluation.
5. Insights from Student Preference Studies
An empirical paper involving ten university students (both undergraduate and graduate) probes preferences over alternative generated answers. Results indicate that:
- Clarity is the most valued property (>60% preferences), especially for medium-difficulty questions.
- Critical thinking prompts and pedagogical techniques are valued but tend to be more prominent depending on the difficulty: undergraduates lean towards more comprehensive, detailed explanations while graduates often prefer conciseness.
- Student preferences feedback is used to inform both the answer-editing process during synthetic data generation and the construction of new qualitative evaluation criteria.
A plausible implication is that automated educational QA systems should calibrate explanation depth and clarity based on question difficulty and target learner profile to maximize engagement and instructional effectiveness.
6. Applications and Research Value
EduVidQA provides a test bed for evaluating the integration of multimodal inputs (video frames, transcript, timestamp context) in educational QA, advancing research in MLLMs tuned for real-world lecture data. Documented model benchmarking and rigorous evaluation illuminate both the strengths (e.g., gains with synthetic fine-tuning) and ongoing challenges (e.g., visual/textual fusion, generalization across question difficulty).
The dataset also enables:
- Finer-grained analysis of where models fail (e.g., on temporal or visual dependencies).
- Development of new metrics to close the gap between surface-level text similarity and qualitative educational utility.
- Comparative evaluation between real-world and synthetic questions for transfer learning and domain adaptation studies.
7. Directions for Future Work
The authors highlight several future research directions:
- Metric Refinement: Bridging the gap between automatic and human-grounded qualitative metrics.
- Domain Extension: Applying the schema beyond computer science to other educational domains.
- Multimodal Reasoning Improvements: Deeper fusion of audio, visual, and textual modalities to improve context sensitivity.
- Reverse QA: Using answer generation as a probe for question quality.
- Customization: Tailored adaptation of models to accommodate learner-level and difficulty-specific requirements for pedagogical alignment.
This suggests EduVidQA will serve as a foundation for robust research in educational NLP, enabling the systematic development and assessment of multimodal educational assistants that respond to diverse and cognitively demanding student questions based on real lecture video content.