Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

Published 8 Apr 2026 in cs.SE and cs.AI | (2604.07304v1)

Abstract: LLMs challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a hybrid Socratic framework that integrates deterministic code analysis with generative dialogue to enhance code understanding assessment.
It categorizes assessment architectures into rule-based, LLM-based, and hybrid systems, highlighting trade-offs between consistency and adaptability.
Empirical results, including a 43% improvement in programming knowledge, demonstrate the framework's potential despite challenges like LLM hallucinations.

Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

Systematic Review Methodology and Search Strategy

The paper conducts a saturation-based scoping review focused on conversational agents within programming assessment systems. Five search string variants, based on the PICOC framework, were constructed to target literature across multiple technical platforms and educational contexts (Figure 1).

Figure 1: Keyword groups and query template used to derive the five search string variants.

A PRISMA-guided workflow, adapted for the rapid evolution and small volume of the domain, was used to systematically identify, screen, and include relevant studies. The review combined multi-source searches, iterative screening, and snowballing to maximize coverage, yielding a final corpus categorized by architecture and pedagogical method (Figure 2).

Figure 2: PRISMA-style flow diagram of identification, screening, eligibility, and inclusion.

Categorization of Conversational Assessment Architectures

Three principal architectures emerged—rule-based/template-driven, LLM-based, and hybrid systems:

Rule-Based/Template-Driven: Utilize deterministic algorithms and handcrafted templates, leveraging static and dynamic code analysis. These methods show strong consistency and auditability but lack adaptability and suffer from limited coverage for novel code patterns.
LLM-Based: Employ generative models (e.g., GPT, Llama, Mistral) to generate contextually appropriate, multi-turn Socratic dialogues. While these enable broader code comprehension and adaptive questioning, they are prone to hallucinations, inconsistent grading, and solution-leakage.
Hybrid Systems: Integrate deterministic program analysis with generative conversational layers, balancing reliability and adaptability. Hybrid systems combine code-grounded evidence with responsive dialogue and guardrails, appearing most suitable for scalable and transparent assessment scenarios.

Benefits, Limitations, and Pedagogical Implications

Empirical evidence demonstrates notable gains: Socratic Author achieved a 43% improvement in programming knowledge, ChatDAC increased post-test scores, and tiered hint engagement correlated positively with exam performance. These agents scale personalized feedback and reduce instructor workload in large cohorts. However, technical limitations such as LLM hallucinations and behavioral risks—student over-reliance, gaming, and superficial engagement—impose constraints on assessment reliability.

A key pedagogical insight is the persistent gap between code-writing and code-understanding competence; students frequently excel on structural questions (>80%) but underperform (<50%) on dynamic execution and tracing tasks. Effective conversational assessment thus requires probing deeper conceptual understanding aligned with explicit reasoning targets.

Hybrid Socratic Framework: Design and Implementation

Deriving from the review, the Hybrid Socratic Framework combines static/dynamic code analysis with a dual-agent conversational layer (Instructor and Verifier), knowledge-tracking, scaffolded questioning, and Socratic guardrails (Figure 3).

Figure 3: Hybrid Socratic Framework for chatbot-based assessment in APASs.

Essential Components

Code Analysis Engine: Extracts deterministic program facts for prompt grounding.
State Space and Knowledge Tracker: Monitors student misconceptions and knowledge gaps across dialogue turns.
Dual-Agent Core: Instructor generates Socratic, program-specific questions; Verifier evaluates explanations against reference states.
Assessment Engine: Integrates evidence from code correctness and explanation quality.
Socratic Guardrails: Constrains agent behavior to prevent solution-giving and off-topic responses.

Integrity Safeguards

To mitigate risks arising from LLM-generated explanations, the framework integrates proctored deployment options, randomized runtime trace questions, stepwise reasoning tied to execution states, and adaptive follow-ups. This shifts assessment from generic explanation to code- and execution-grounded reasoning.

Agent Prompting and Assessment Strategy

Agents operate under strict role-specific prompts adapted from established templates, separating question-generation from explanation evaluation. A two-tier assessment strategy is used: Tier 1 (selection via scenario-based MCQs) ensures initial conceptual engagement, and Tier 2 (open-ended explanation) measures depth of understanding, with scoring weighted toward explanation quality.

Figure 4: Proof of concept, Tier 1: scenario-based multiple-choice question generated by the Instructor Agent.

Figure 5: Proof of concept, Tier 2: open-ended explanation graded by the Verifier Agent.

Discussion and Practical/Theoretical Implications

The results imply a necessary evolution in APASs—from functional artifact assessment to verification of reasoning and explanation. Effective adoption will depend on transparent role definition, prompt policies, scoring thresholds, and clear formative/summative separation. Ethical considerations require auditability, appeal mechanisms, and ongoing validation for fairness and consistency—especially in high-stakes grading. Digital sustainability needs vendor-independent model hosting and reproducible prompt/version management.

Hybrid architectures both address and create new complexities, balancing coverage, adaptability, and operational auditability while introducing infrastructural burden. The applicability of conversational assessment extends to program comprehension, algorithmic reasoning, and authorship verification, anchoring assessment in observable code behavior rather than unconstrained model interpretation.

Threats to Validity

The rapid evolution of LLMs and conversational tools presents ongoing threats to methodological validity. Unvalidated framework parameters and reliance on untested prototype deployments limit generalizability. The synthesis should be interpreted as provisional guidance pending empirical classroom validation.

Conclusion

Conversational assessment, grounded in deterministic program analysis and scaffolded Socratic dialogue, constitutes a promising complementary layer for verifying code understanding in the era of ubiquitous LLMs. The Hybrid Socratic Framework offers a balanced approach, combining the strengths of deterministic evidence and adaptive dialogue. Sustained adoption requires empirical validation, robust knowledge-tracking, multimodal extensions, and careful integration of formative/summative practices to support privacy, reliability, and fairness.

Future Perspectives

Empirical classroom deployments and human-machine comparison studies will be critical in validating framework reliability and acceptance. Development of advanced knowledge-tracing and multimodal assessment modalities will enhance robustness. Systematic investigation into deployment modes, privacy/fairness impacts, and their behavioral consequences is essential for principled integration of conversational agents in programming assessment.

Markdown Report Issue