A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

Published 1 Apr 2026 in cs.CY, cs.AI, cs.LG, and cs.SE | (2604.00730v1)

Abstract: Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel framework that fuses Fuzzy C-Means clustering with CEFR-based ordinal mapping to automate assessment in Scratch.
It leverages data from over two million Scratch projects and nine computational thinking features to identify clear and transitional learner profiles.
The methodology demonstrates high generalizability and provides adaptive curriculum design insights through continuous progress tracking and certainty quantification.

A Fuzzy C-Means and CEFR-Aligned Framework for Automated Programming Skill Assessment in Scratch

Introduction and Motivation

This work presents a comprehensive, data-driven framework for scalable programming skill assessment in block-based learning environments, specifically Scratch. The approach fuses Fuzzy C-Means (FCM) clustering with the Common European Framework of Reference (CEFR) proficiency scale, enabling a pedagogically aligned, ordinal classification of computational thinking (CT) competencies. The primary innovation is mapping unsupervised cluster centroids to CEFR levels (A1--C2) via an $S_j$ -ordering criterion, followed by enhanced classification strategies that identify transitional learners, quantify classification certainty, and report continuous progress metrics.

The impetus stems from a convergence of educational needs: scalable yet interpretable skill diagnostics, actionable insights for adaptive curriculum design, support for personalized learning trajectories, and integration within standards-based frameworks such as DigCompEdu. Traditional assessment tools (Dr.Scratch, CTt, Bebras, Hairball, LitterBox) measure discrete CT constructs yet lack holistic, standardized proficiency mappings or nuanced modeling of learner states between categories.

Methodological Framework

Data and Feature Engineering

The empirical base consists of over two million public Scratch projects, automatically mined and assessed via Dr.Scratch, extracting nine ordinal CT features (0--4) per project: Abstraction, Parallelization, Logic, Synchronization, Flow Control, User Interactivity, Data Representation, Math Operators, and Motion Operators. Projects are stratified into training (80%) and test (20%) sets, yielding robust train-test consistency for external validation.

Ordinal Fuzzy Clustering and CEFR Mapping

The FCM algorithm is used (fuzzification $m=1.5$ , $\epsilon=10^{-5}$ ), producing soft partitioning of projects across $k=6$ clusters. This enables rich representation of skill gradients, capturing both clear and transitional learner profiles. The novel $S_j$ criterion, defined as the sum over CT centroid values per cluster, imposes a deterministic ordinal mapping onto the categorical FCM outputs, directly aligning clusters to the ascending CEFR sequence.

Figure 1: Gap statistic for $k = 2$ to $k = 12$ clusters, confirming non-discrete underlying distributions and justifying domain-prioritized $k=6$ CEFR alignment.

The rigorous selection of six clusters, rather than the statistically optimal two, is justified by (i) international standardization (CEFR, DigCompEdu), (ii) retention of essential granularity for transition-state detection, and (iii) statistically significant discrimination across all pairs (Kruskal–Wallis, Mann–Whitney $p<0.001$ ).

Enhanced Classification and Certainty Quantification

Beyond single-label assignments, the framework introduces three orthogonal metrics:

Classification Type: "Clear" (primary membership $\geq$ 0.5), "Transition" (primary-secondary gap $m=1.5$ 0 0.15), and "Predominant" (all others), allowing for fine-grained differentiation relevant to formative assessment.
Continuous Score: Membership-weighted average across ordinal CEFR levels, supporting high-resolution progress tracking ( $m=1.5$ 1).
Certainty: Shannon entropy-derived metric capturing cluster ambiguity, partitioned as Low ( $m=1.5$ 2 0.4), Medium, or High ( $m=1.5$ 3 0.7) to control human-in-the-loop escalation.

Results

Model Generalization and Robustness

The model demonstrates minimal generalization gap: Silhouette scores and Average Certainty remain stable between training and test splits ( $m=1.5$ 4 Silhouette = $m=1.5$ 5, $m=1.5$ 6 Certainty = $m=1.5$ 7), indicating high transferability.

Figure 2: Train-test performance comparison demonstrates generalization stability with negligible degradation across evaluation metrics.

Principal Component Analysis shows substantial overlap between adjacent clusters, confirming the empirical continuum of programming skill acquisition rather than discrete categories.

Figure 3: PCA projection of training set observations colored by FCM cluster, centroids ordered by CEFR level, visualizing natural gradation and overlap.

Centroid Profiles and Diagnostic Bottlenecks

Centroid analysis elucidates clear progression across CEFR levels, with low-level clusters characterized by minimal engagement (A1), intermediate clusters showing partial ability across multiple CT facets (B1, B2), and upper clusters (C1, C2) reflecting consistently high scores.

A pronounced "B2 bottleneck" is observed: only 13.3% of learners reach this transitional zone, contrasted by heightened density at A1/A2 (39.3%) and C1/C2 (27.8%). This is further substantiated by dimension-specific rank correlations, with Synchronization, Flow Control, and Logic providing the strongest discriminatory power ( $m=1.5$ 8).

Enhanced Classification in Practice

Test set results distribute as 79.1% "Clear," 13.7% "Transition," and 7.3% "Predominant," recovering the full distributional spread of learner states across the progression spectrum.

Figure 4: Test set classification distribution across assigned levels, highlighting transition states, and continuous score breakdown with CEFR markers.

Certainty analysis shows a nearly equal partitioning into Low, Medium, and High domains, optimizing for flexible human review scheduling based on resource constraints and pedagogical risk tolerance.

Figure 5: Certainty analysis by categorical certainty level, including continuous score histogram and box plots by classification type.

Representative Learner Profiles

Visualization of prototypical cases demonstrates the model's discriminative capacity. Example profiles include:

A1 (Clear): Uniformly minimal CT dimensions; membership $m=1.5$ 9 1.0; certainty 0.99.
B2–B1 (Transition): Mixed strengths; balanced memberships (B2 $\epsilon=10^{-5}$ 0, B1 $\epsilon=10^{-5}$ 1); certainty $\epsilon=10^{-5}$ 2; continuous score $\epsilon=10^{-5}$ 3.
C2 (Clear): All dimensions maxed; membership $\epsilon=10^{-5}$ 4; certainty $\epsilon=10^{-5}$ 5.
Figure 6: A1 (Clear) profile—complete beginner with no demonstrated CT competencies.

Figure 7: B2–B1 (Transition) profile—heterogeneous skill signature, substantial ambiguity in level assignment.

Figure 8: C2 (Clear) profile—uniform mastery across all CT dimensions, unambiguous expert classification.

Baseline Comparison

FCM outperforms MiniBatchKMeans and DBSCAN in preserving soft cluster boundaries and capturing transitional learners. KMeans is overly rigid with inflated APN, while DBSCAN fails to recover ordinal structure, producing fragmented and noisy outputs.

Implications

Educational Practice and Curriculum Design

The transition-centric framework enables precise diagnosis of systemic curriculum gaps, including empirically validated bottlenecks (B2) rooted in the cognitive load of synchronizing advanced CT constructs. By surfacing transition states and certainty-driven triage, educators can target interventions precisely where learners are most likely to benefit from human mediation or reinforced scaffolding.

The continuous and probabilistic nature of assessment supports adaptive progression, nuanced feedback, and mastery-based advancement, aligning with emerging trends in competency-based education. These findings inform strategic curriculum sequencing, with emphasis on Synchronization, Logic, and Data Representation as progression anchors, and early plateaued constructs (e.g., User Interactivity) as foundational.

Toward Standardized, Scalable Credentialing

Mapping to CEFR enables direct integration within international credentialing systems and alignment with DigCompEdu. Practical instantiations include blockchain-backed digital badges, standardized certification exams, and institutional analytics for workforce planning.

Potential Extensions and Limitations

Extensions include adaptation to text-based languages (e.g., Python) via language-agnostic CT metrics or AST-based representations, longitudinal tracking for progression modeling, multimodal assessment incorporating code quality, debugging, and collaboration, and validation across varied cultural or linguistic contexts. Limitations stem from domain specificity (Scratch), lack of creativity/elegance instrumentation, and cross-cultural generalizability.

Conclusion

This study establishes that FCM-based fuzzy ordinal clustering—mapped to the CEFR standard—constitutes a robust, interpretable, and scalable solution for automated programming skill assessment in Scratch. The methodology bridges categorical and continuous assessment, preserves essential information about learners in transition, and quantifies the epistemic uncertainty inherent in automated diagnostics. These features make the framework immediately relevant for large-scale learning analytics, adaptive educational technology, and competency-driven curriculum reform. The approach is generalizable to additional domains and directly supports institutional credentialing and personalized pedagogical intervention.

Markdown Report Issue