ScratchEval: Evaluation Framework for Scratch
- ScratchEval is a comprehensive suite for assessing block-based Scratch programs, merging visual reasoning tests, dynamic execution, and automated bug repair.
- It leverages multimodal evaluation protocols that jointly test visual parsing, logical reasoning, and control-flow understanding within Scratch environments.
- Its automated testing frameworks support dynamic assessment and educational feedback via block-based test scripts and search-based test generation techniques.
ScratchEval denotes a set of closely related benchmarks, evaluation methodologies, and functional testing frameworks targeting the understanding, analysis, repair, and grading of block-based visual programs—predominantly in the Scratch environment. It encapsulates both multimodal LLM evaluation protocols using visual programming challenges and executable test architectures for dynamic assessment and automated feedback within educational contexts. The term refers to several distinct but interrelated artifacts, including a visual reasoning benchmark for LMMs (Fu et al., 2024), an executable LLM repair benchmark (Si et al., 31 Jan 2026), and reference test generation systems for block-based code (Feldmeier et al., 2024, Deiner et al., 2022, Wang et al., 2021).
1. Motivation and Conceptual Foundations
Block-based languages such as Scratch are designed to remove textual syntactic burdens, enabling novices to acquire core programming concepts through direct manipulation and immediate feedback. Traditional programming benchmarks and evaluation protocols (e.g., image-to-code benchmarks) fail to capture the complex joint reasoning required to analyze integrated visual-program logic and program intent as found in Scratch. ScratchEval was proposed to bridge this evaluation gap by introducing benchmarks and tools that comprehensively test the visual, logical, and behavioral understanding of Scratch projects, allowing both researchers and practitioners to assess model or human performance in domains requiring unified perception and program reasoning (Fu et al., 2024).
Unlike text-based code evaluation, which is amenable to sequence models, block-based programs exhibit non-linear, event-driven, and highly concurrent execution patterns, often coupled tightly with multimedia assets and graphical state. Automated evaluation in this paradigm necessitates frameworks capable of dynamic execution, block-level analysis, and multimodal input/output comparison (Si et al., 31 Jan 2026). The corresponding testing architectures are further tasked with supporting automated grading, batch feedback dissemination, and minimal instructor overhead (Feldmeier et al., 2024).
2. Multimodal Visual Reasoning Benchmarks
The ScratchEval benchmark (Fu et al., 2024) measures the ability of large multimodal models (LMMs) to reason about visual block-based programs as presented through screenshots, accompanied by behavioral queries. Its design targets joint visual parsing and logical reasoning:
- Dataset Construction: 305 manually curated multiple-choice questions, each comprising a screenshot of a real Scratch script, a text question probing behavioral intent/result, and four answer options (A/B/C/D). Dual-language (English/Chinese) versions enable cross-linguistic robustness. Questions span four cognitive categories: Mathematics (n=133), Logical thinking (99), Graphic perception (59), and Spatial perception (43).
- Evaluation Protocol: System prompt, script image, and question are given to the model. Free-form LMM answers are mapped to answer indices; overall and per-category accuracy is computed as the primary metric. Models are compared under different prompting regimes: no-CoT, zero-shot CoT, and explicit CoT ("explain step by step").
- Model Coverage: Closed models (Gemini-1.5-Pro, GPT-4o, GPT-4-Turbo, Claude-3.5-Sonnet) and open models (Qwen2-VL, LLaVA-v1.6, InternVL2, Pixtral, MiniCPM-v2.6, Molmo) were evaluated.
- Empirical Findings: No model exceeded 53% total accuracy; all perform below human-aligned baselines (>80%), with particular weakness on mathematics and logical control-flow (often <45%). CoT-enhanced prompts boost accuracy by 10–20%; explain CoT yields marginal or negative returns due to verbosity.
- Error Analysis: Models struggle with block symbol misreading, subtle control-flow ordering, and hallucinate block meanings. Larger vision-LLMs trained on richer multimodal data outperform smaller counterparts, but hallucination and symbol confusion remain principal failure modes.
This protocol exposes limitations in current LMMs for joint perception–reasoning over visual code and highlights the distinction between textual and visual program understanding benchmarks (Fu et al., 2024).
3. Executable Program Repair and Debugging Evaluation
A complementary strand of ScratchEval targets executable evaluation of LLMs on program repair, debugging, and analysis for authentic, complex Scratch projects (Si et al., 31 Jan 2026). This paradigm moves beyond static question-answering to dynamic, VM-mediated testing:
- Benchmark Composition: 100 semantically complex, multi-sprite Scratch projects, each hand-vetted for diversity (games, animations, stories) and structural features (≥5 sprites, ≥15 scripts, ≥3 broadcasts, ≥1 custom block).
- Bug Synthesis: Eight semantically grounded bug patterns are systematically injected via Bug Forge, ensuring that injective edits introduce concrete, test-detectable functional failures without redundant or trivial changes.
- Test Suite Generation: For each project, interaction scenarios invoke possible events (green flag, keys, broadcasts), traced at deterministic checkpoints. Differential oracles are synthesized to distinguish gold (correct) and buggy executions.
- Evaluation Protocol:
1. Functional Correctness: A proposed repair patch is applied; the project is re-executed under synthesized scenarios; repairs are scored as functionally correct if all behavioral oracles are passed. 2. Edit Distance: Symmetric-difference in block-level edit sets quantifies repair minimality; semantic drift is measured via normalized state/behavioral trace discrepancies. 3. Explanation Rubric: Free-form explanations are scored by LLM judges using a 1–5 rubric for trigger–mechanism–outcome (TMO) alignment.
- Experimental Outcomes: Gemini 3 achieved 41% repair success, ChatGPT 32%, Qwen 0-shot 23% (LoRA-tuned Qwen 26%). LoRA tuning primarily reduced edit distance and drift. Explanation accuracy similarly favored Gemini and ChatGPT (G-Acc up to 81%).
These findings underscore the challenge of joint semantic, structural, and multimodal reasoning for LLMs, even with domain-adapted fine-tuning. The methodology provides a reproducible foundation for rigorous LLM assessment on Scratch and comparable environments.
4. Automated Testing Frameworks: Block-Based and Search-Based
ScratchEval encompasses automated testing architectures designed for high-coverage assessment and batch feedback, leveraging both block-level DSLs and search-based test generation:
- Block-Based Test Blocks: An extensible category of test primitives (control, trigger, assertion, reporter) is integrated into the Scratch environment (Feldmeier et al., 2024). Tests are constructed as block scripts, supporting event injection (e.g., trigger_key_pressed), behavioral assertions (e.g., assert_gt(sprite_x(…))), and VM state restoration (test_init, test_reset). The framework extends the UI with a test panel, examples gallery, and batch runner supporting multi-project grading.
- Quantitative Evaluation (Teacher-Created Tests): In an empirical study, 28 teachers authored tests for a canonical game and assessed 21 student solutions. The median accuracy of teacher-authored tests with respect to a gold suite was 0.93, confirming the viability of block-based dynamic assessment for educational use.
- Search-Based Test Generation: Advanced frameworks (e.g., those deployed in Whisker, ported for ScratchEval) employ many-objective search (MOSA, MIO) to automatically generate maximizing-coverage event sequences (Deiner et al., 2022). The system ensures determinism by seeding randomness and translating delays and animations to step-based analogues, yielding wall-clock savings and non-flaky results. MOSA and MIO achieved 69% coverage on top-1000 Scratch projects, with significant speed and minimal test suite size over random generation.
- SnapCheck-Inspired Dynamic Testing: Condition-action templates capture WHEN–THEN logic, enabling rich, property-based assessment (Wang et al., 2021). These templates are formalized as 5-tuples (trigger predicate, delay, assertion predicate, scope), supporting spatiotemporal assertions and event coverage. Porting required extensions for Scratch broadcast/clone predicates.
This intersection of user-facing test blocks and backend search-based techniques forms the technical substrate for broad-spectrum Scratch program assessment.
5. Applications in LLM- and Human-Centric Evaluation
ScratchEval methodologies enable both LLM-centered and human-centered analysis:
- LLM-Centric Benchmarks: Directly evaluates the capacity of LMMs to parse, explain, and predict behavior from visual code representations. Enhances model development by highlighting current limitations in block-level visual encoding and logical reasoning.
- Repair and Debugging Assistance: Provides executable and minimal-edit ground truth for repair tasks, enabling controlled evaluation of model-generated fixes at semantic and behavioral levels.
- Educational Assessment and Automated Grading: Empirical results support teacher-constructed dynamic tests and batch grading pipelines as accurate and scalable for formative and summative assessment (Feldmeier et al., 2024). Feedback workflows integrate directly with the Scratch UI, and the block-based model supports remixing/templatization for pedagogical reuse.
- Design Guideline Extraction: Analysis of LLM responses in family creative coding contexts informed guidelines for promoting learner agency, multimodal debugging, live execution feedback, and individualized instructional pathways (Druga et al., 2023).
These application domains collectively advance the state of the art in both automated and human-in-the-loop assessment of block-based programming proficiency.
6. Limitations and Future Research Directions
Key limitations of current ScratchEval frameworks include:
- Restriction to multiple-choice or scripted-output tasks in LMM evaluation, limiting the expression and assessment of open-ended or interactive programming understanding (Fu et al., 2024).
- Single-bug focus in repair benchmarks, with synthesized test suites that may miss untested regressions or bug interactions (Si et al., 31 Jan 2026). The coverage of bug types is bounded by the injected pattern catalog.
- Usability challenges in manual authoring of block-based tests; discovery and sequencing of tests may present obstacles to teachers/users unfamiliar with the assertion palette (Feldmeier et al., 2024).
- Model evaluation may be confounded by pretraining exposure to benchmark data, though human-in-the-loop curation mitigates duplication.
- Practical LLMs exhibit persistent hallucinations, especially regarding block semantics, control dependencies, and event-driven concurrency.
Open research questions include: the design of vision–graphical structure encoders for block arrangement parsing; self-debugging/execution-based verification within LMM reasoning loops; optimal multimodal pretraining datasets for block-based logic; transfer pathways from block-based to textual code reasoning benchmarks; and collaborative human–AI grading and tool-building platforms (Fu et al., 2024, Si et al., 31 Jan 2026).
Proposed extensions include integration of open-ended natural language responses, expanded multi-bug and multi-sprite scenarios, live simulation/interaction APIs, and richer debugging/validation tools.
7. Summary Table: ScratchEval Facets
| Area | Key Artifact or Approach | Core Contribution |
|---|---|---|
| Multimodal reasoning benchmark | (Fu et al., 2024) | Visual code QA, logic+perception |
| Executable LLM repair/debugging | (Si et al., 31 Jan 2026) | Paired project+bug+test suites, edit drift metrics |
| Block-based test block framework | (Feldmeier et al., 2024, Wang et al., 2021) | UI-integrated test building, grading |
| Search-based test generation | (Deiner et al., 2022) | Deterministic, coverage-maximizing suite synthesis |
| LLM-based creative coding assessment | (Druga et al., 2023) | LLM evaluation in pedagogical contexts |
This technical ecosystem positions ScratchEval as a comprehensive suite for evaluating code understanding and repair—spanning LMMs, automated testing, and educational feedback in block-based languages. Its benchmarks and frameworks establish rigorous, reproducible, and extensible protocols for the study and improvement of visual programming reasoning systems.