Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

Published 18 Jun 2026 in cs.AI | (2606.20264v1)

Abstract: Student-generated drawings are widely used in science education to assess learners' conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel confidence-aware paradigm that uses a Vision Transformer backbone with LoRA to selectively score high-confidence student-drawn scientific models.
It employs test-time semantic perturbations to measure response-level confidence, optimizing the balance between automated scoring and manual review.
Experimental results show improved accuracy (78.9%), Cohen’s kappa (0.760), and F1 score (0.727) compared to baselines, ensuring reliable classroom application.

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models: An Expert Review

Introduction

The paper "Confidence-Aware Automated Assessment of Student-Drawn Scientific Models" (2606.20264) addresses the challenge of scalable, reliable scoring of student-generated scientific drawings in NGSS-aligned (Next Generation Science Standards) educational contexts. Drawing-based representations are central to evaluating scientific modeling practice, yet their complex, open-ended nature has traditionally necessitated expert evaluation, imposing substantial logistical and cost constraints on high-throughput classroom assessment. The authors propose a novel confidence-aware paradigm utilizing a Vision Transformer (ViT) backbone augmented with parameter-efficient Low-Rank Adaptation (LoRA). Central to the approach is response-level confidence estimation from test-time predictive distributions, enabling principled selective automation—high-confidence responses are scored automatically, whereas low-confidence responses are deferred for manual review. This addresses not only prediction accuracy but also the trustworthiness and interpretability requirements fundamental to educational decision-making.

Methodology

The core methodology leverages a pretrained ViT model, fine-tuned with LoRA for efficient domain adaptation. The model is trained to classify student-drawn images into rubric-based proficiency levels: Beginning, Developing, and Proficient. Test-time confidence estimation is implemented via semantic-preserving perturbations (e.g., crop, rotation), aggregated to derive a response-level predictive distribution. The confidence score $k(x)$ is calculated as the probability mass assigned to the argmax proficiency label across all perturbations. Selective automated scoring is enacted by thresholding $k(x)$ , automatically scoring only responses above the threshold and deferring others. Furthermore, selective trust refines the aggregation process by filtering perturbations based on a decisiveness metric, effectively providing robustness against superficial variability in student visual representation. These mechanisms enable the practical trade-off between coverage and reliability—an essential property in operational classroom deployment.

Experimental Results

Experiments were conducted on a dataset comprising student drawings from six middle school science modeling items, each mapped to ordered proficiency levels and independently annotated by domain experts. The evaluation benchmarks four approaches: ViT (Frozen), ViT+LoRA, CA-Uniform, and the proposed CA-Selective. Across classification metrics—including accuracy, Cohen’s kappa, precision, recall, and F1 score—the CA-Selective method consistently yielded the highest or near-highest values:

Average accuracy: CA-Selective achieved 0.789 versus 0.766 for ViT+LoRA and 0.289 for ViT (Frozen).
Average Cohen's kappa: CA-Selective reached 0.760, significantly exceeding ViT+LoRA (0.708) and ViT (Frozen) (-0.016).
F1 score: CA-Selective obtained 0.727, indicating strong alignment with expert scoring.

Notably, the model complexity remains low; LoRA adaptation adds only 0.6M parameters on top of the ViT backbone (86.4M), and confidence-aware inference incurs higher inference cost due to augmented views but is orders of magnitude faster than multimodal LLM scoring (e.g., Qwen3-VL-8B-Instruct). The vision-language baseline performed less reliably and with significantly greater latency, underscoring the competitiveness of task-adapted vision-only architectures for rubric-aligned drawing assessment under operational constraints.

Confidence Analysis

The response-level confidence score $k(x)$ demonstrated substantial practical utility, with a strong positive correlation ( $r = 0.649$ , $p < 0.01$ ) between mean confidence and predictive accuracy across proficiency levels. This empirical alignment validates $k(x)$ as a reliable proxy for the trustworthiness of automated scores, particularly for visually simple ('Beginning') responses. The framework thus supplies educators with an actionable signal for triaging student work: scores with high $k(x)$ can inform instructional decisions directly, while low-confidence cases highlight the need for expert review. This is a salient advancement for responsible deployment in real-world classroom settings where visual ambiguity, representational variability, and pedagogical stakes demand nuanced automation.

Implications and Future Directions

The integration of test-time confidence estimation and selective aggregation sets a new standard for trustworthy automated scoring in educational contexts. Practically, this enables scalable classroom assessment with reduced reliance on manual scoring, improved reliability, and actionable interpretability. Theoretically, the method situates confidence estimation as a foundational element for model behavior auditing—relevant not only in education, but across domains requiring nuanced human-AI collaboration.

Potential future directions include:

Generalization across populations: Extending evaluation to diverse student cohorts and curricular contexts to rigorously test robustness.
Multimodal integration: Enhancing models with written explanations or other modalities for finer-grained assessment and feedback.
Fine-grained feedback: Leveraging intermediate attention or interpretability mechanisms for actionable formative feedback.
Calibration studies: Investigating the relationship between confidence signals and actual annotation uncertainty, possibly incorporating human-in-the-loop calibration.

Such advancements would further align automated assessment with instructional goals, increase trust and transparency, and expand applicability to broader learning environments.

Conclusion

This paper presents a technically rigorous and operationally viable framework for confidence-aware automated scoring of student scientific drawings using ViT architectures with LoRA adaptation. By leveraging response-level confidence estimation and selective aggregation, the system achieves improved agreement with expert annotations while supplying a clear, interpretable signal for end-user trust calibration. The research advances both the practice and theory of automated educational assessment, with immediate application potential and substantial room for future refinement and extension.

Markdown Report Issue