- The paper demonstrates that explicit, checklist-based rubric design dramatically enhances human-AI scoring reliability, with ICCs rising to 0.94 in clear cases.
- The paper finds that 0-shot prompting leads to more consistent scoring compared to 1-shot, particularly for responses at performance extremes.
- The paper reveals that LLM temperature variations have minimal impact on scoring reliability, underscoring rubric precision and prompt strategy as key factors.
Reliable LLM-Assisted Rubric Scoring in Physics: A Technical Analysis
Introduction
The adoption of LLMs such as GPT-4o for automated scoring of STEM constructed responses presents both promises and unresolved challenges. Constructed-response assessment in domains like physics often requires interpretation of handwritten work, symbolic expressions, and diagrams, creating substantial heterogeneity in both student outputs and human rater interpretation. This paper rigorously investigates how rubric structure and LLM configuration—including prompting and temperature—modulate the reliability and alignment of AI-assisted scoring versus human raters in the context of undergraduate physics exams ["Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams" (2604.12227)].
Methodology and Experimental Setup
The authors conduct a controlled evaluation using authentic physics exam responses (handwritten, multimodal) collected from a first-year engineering course. Twenty responses were scored in two sequential rounds, each utilizing distinct rubric designs:
- Round 1 utilized a skill-based holistic rubric (Rubric V1), assessing performance across conceptual, mathematical, and problem-solving dimensions but relying on global, qualitative descriptors.
- Round 2 transitioned to a more granular, checklist-based rubric (Rubric V2) operationalizing explicit, part-specific skill indicators.
Each response was independently scored by four domain-expert instructors and by GPT-4o, with systematic variation of LLM prompting format (0-shot vs. 1-shot) and temperature (0.3, 0.5, 0.7, 1.0). This multifactorial design enabled isolation of the effects of rubric granularity, prompt structure, and sampling randomness on inter-rater agreement.
Agreement was quantified with intraclass correlation coefficients (ICCs) at the total-score, criterion, and performance-subgroup (extreme vs. mid-level) levels.
Core Findings and Numerical Results
Human-AI Agreement: Aggregate Patterns
Across both rounds, GPT-4o's total-score agreement with human raters was on par with human inter-rater reliability—provided rubric criteria were explicit and responses were unambiguous. In Round 1 (holistic rubric, default AI settings), overall ICCs were 0.88 (human-only) and 0.89 (including AI). In Round 2 (checklist rubric, 0-shot AI prompting), ICCs reached 0.94 (human-only) and 0.92 (with AI at temperature 0.3), which represent excellent reliability.
Agreement was maximized for extreme-performing responses: ICCs of 0.94–0.93 for high- and low-performing groups were observed, whether comparing human-only or human-AI scores. Conversely, reliability fell sharply for mid-performing responses (ICC ≈ 0.28–0.30 in Round 1), highlighting persistent ambiguity and interpretive challenges for partial-credit assignments. Fine-grained rubric application in Round 2 ameliorated but did not eliminate this effect for 0-shot prompting; however, 1-shot prompting further exacerbated mid-level disagreement (ICCs as low as -0.09 to 0.14).
Criterion-Level Analysis
AI-human agreement at the criterion level closely tracked the semantic precision of the rubric dimension:
- Well-defined conceptual or representational items (e.g., kinematic recognition, vector conversion) yielded ICCs of 0.91–0.94.
- Procedural or integrative criteria requiring interpretive judgment (e.g., multi-step problem solving, extraction of relevant information) showed notably lower ICCs (down to 0.33–0.51 with AI).
- The lowest consistency was seen for mathematical procedure criteria where working-out was fragmented or error-prone, both for humans and AI.
Rubric, Prompt, and Temperature Interactions
- Rubric granularity dominated reliability outcomes. Transitioning to checklist-style rubrics led to more consistent partial-credit assignment, especially in ambiguous responses, and increased both human and AI scoring agreement.
- Prompting format (0-shot vs. 1-shot) had a secondary but meaningful impact. 0-shot prompting consistently produced higher and less variable agreement than 1-shot prompting, especially for mid- and low-performers—likely due to overfitting to the provided exemplar in 1-shot settings, which typically illustrated full-mastery solutions.
- Temperature had minimal influence on reliability: ICC values with AI varied <0.02 across temperature settings in most conditions, suggesting that prompt and rubric specification are the principal levers for reliable LLM scoring in this regime.
Implications for AI-Driven Assessment in STEM
Practical Deployment
- Rubric engineering is central. Explicit, analytic, part-specific criteria (informed by cognitive diagnostic modeling) yield quantifiably higher reliability, mirroring trends in both human and AI scoring. Investments in rubric design outweigh marginal tuning of LLM hyperparameters or prompting tweaks.
- Hybrid workflows are essential for partial-credit reasoning. AI models robustly replicate human scoring for clearly correct/incorrect work but propagate human ambiguity when reasoning is partial or evidence distributed. AI integration should emphasize automated triage, with ambiguous cases flagged for human oversight.
- Prompt engineering should prioritize zero-shot, rubric-specific calibration over limited 1-shot exemplars, unless a diverse spectrum of response types is provided.
Theoretical Insights and the Role of LLMs
- The findings support a view that LLMs in STEM assessment primarily reflect, rather than transcend, the boundaries of current rubric expressivity and rater consistency. The representational alignment between rubric and student work is more determinative than LLM generation stochasticity.
- The study empirically grounds claims that reliability issues in AI scoring are not fundamentally new but are closely aligned with known interpretive ambiguities in STEM constructed-response assessment. This raises priorities for future research in rubric design automation, calibration of partial-credit anchors, and expansion to other multimodal, domain-rich contexts.
Future Research Trajectories
Potential avenues include:
- Extension of the study design across disciplines with high partial-credit ambiguity (engineering design, advanced mathematics, laboratory-based science).
- Systematic benchmarking of alternative LLM architectures and multimodal vision-LLMs for unconstrained handwritten and diagrammatic inputs.
- Automated rubric construction and skill-mapping using advances in interpretable ML and cognitive modeling.
- Development of explainability protocols for AI scoring, especially around divergent human/AI ratings.
Conclusion
This research provides robust, domain-informed evidence that the primary determinant of reliable LLM-assisted scoring for constructed STEM responses is the granularity and explicitness of rubric design. While LLM configuration (prompt format, temperature) plays a role, it is secondary to the representational clarity imparted by rubrics that align observable student work with explicit skill criteria. The results underscore the need for principled rubric engineering and hybrid assessment strategies to harness the efficiency of LLMs while preserving validity and interpretability in STEM education assessment workflows.