- The paper shows that curated GenAI tutoring yields 98.9% accuracy and high relevance in upskilling SE students in domain and modeling tasks.
- The study employs strict prompt engineering and a tailored knowledge base to ensure precise response quality and effective learning outcomes.
- The research identifies actionable design practices for scaling genAI support in SE education, while noting low affective supportiveness that requires improvement.
GenAI-Supported Upskilling in Software Engineering Education
Study Context and Motivation
The paper "From Domain Understanding to Design Readiness: a playbook for GenAI-supported learning in Software Engineering" (2604.00120) addresses a critical challenge in software engineering (SE) pedagogy: enabling rapid upskilling in essential but diverse supporting knowledge areas, including domain understanding (here, cryptocurrency finance) and modeling methodologies (notably Domain-Driven Design, or DDD). Traditional lecture-based instruction faces well-documented scalability and personalization limitations, especially in the context of team-based, design-intensive SE courses. While LLM tutoring systems have proliferated for programming-related support, their integration and impact in modeling-centric and domain-learning settings remain inadequately evaluated.
Experimental Design and Methodological Rigor
The authors executed a two-week intervention with 29 master's-level SE students, leveraging a customized ChatGPT-based (GPT-3.5) tutor grounded strictly in a curated course knowledge base (including slides, project specifications, and exemplars). Students, mostly with minimal finance or DDD experience, engaged with this genAI tutor through a regimented set of individually completed prompts, reflection reports, and rigorous knowledge checks. Data collection spanned demographics, AI-/domain-knowledge self-efficacy (pre/post), and comprehensive logging of all student–tutor interactions. A stratified 34.5% random sample of 174 prompt-answer pairs was independently scored using a five-dimension rubric: accuracy, relevance, pedagogical value, cognitive load, and supportiveness. Inter-rater reliability was established via TA-calibrated rating subsets.
Results: Answer Quality, Pedagogical Efficacy, and Supportiveness
The core findings demonstrate that, when carefully constrained by both a curated knowledge base and explicit prompt engineering, genAI tutors can achieve extremely high answer quality for upskilling in SE domain and modeling tasks. Specifically:
- Accuracy: The genAI responses displayed an average accuracy of 98.9%, with zero major hallucinations and only two minor inaccuracies (contextual mismatch or slight terminological imprecision), thus strongly countering prevailing narratives about the unreliability of LLM-based tutors in technical domains.
- Relevance: Mean relevance across prompts was 92.2%, with off-topic excursions strongly correlated to gaps in the background knowledge base, or insufficient prompt specificity.
- Pedagogical Value: Structuring, depth, and inclusion of relevant examples propelled a pedagogical value mean of 89.4%—sufficient for nontrivial conceptual upskilling.
- Cognitive Load: Responses were, on average, accessible and concise (82.8%); cognitive overload generally arose from excessive verbosity, redundant phrasing, or uncontrolled inclusion of advanced terminology.
- Supportiveness: The notable outlier, supportiveness, averaged just 37.8%. The default genAI persona delivered largely neutral, information-focused responses, rarely incorporating motivational language or affective scaffolding absent explicit prompt instructions.
Statistically significant gains in students' self-reported confidence across all genAI-relevant self-efficacy measures were observed (effect sizes up to dz​≈1.44), suggesting substantial practical impact on students' domain and modeling readiness.
Analysis of Failure Modes and Design Recommendations
Detailed analysis of recorded failure modes led to the consolidation of seventeen actionable teaching and course design practices for maximizing the efficacy of genAI-supported upskilling:
- Prompt Engineering: Explicitly constraining acceptable granularity, verbosity, and the number of examples; banning unexplained jargon; curating context-fitting guardrail exemplars; and ensuring that all prompts are mapped precisely to course learning outcomes.
- AI Configuration: Grounding the model in an exhaustive and context-specific knowledge base proved indispensable, with coverage gaps directly causing relevance and accuracy dips.
- Workflow and Assessment: Mandatory, concise log submissions, careful knowledge-check rubrics, and provision for varying question difficulty are critical both for instructional accountability and assessing true learning gains.
Scaling the approach involves TA calibration for rubric-based grading and minimal but sufficient student reporting requirements. The paper offers an adoption path for larger courses and different SE domains.
Theoretical and Practical Implications
The study provides clear evidence that, under strong instructional design controls and knowledge base curation, genAI tutoring can directly complement SE instruction in modeling and domain learning, domains where conventional LLM support is considered unreliable. The main theoretical implication is that the critical determinant of genAI tutoring reliability is the granularity of grounding and constraint in both prompt engineering and background knowledge, not merely model selection or interaction volume.
Practically, this supports a shift toward structured, prompt-driven AI upskilling workflows co-defined with course objectives, short accountability cycles, and a focus on rapid student readiness for collaborative design tasks. Furthermore, the low supportiveness observed signals the need for more active tuning and persona engineering to fulfill the affective and engagement dimensions of effective tutoring, as identified in cognitive engagement frameworks (e.g., ICAP).
Future Directions
The study's contextual limits—single institution, specific course, GPT-3.5, prescribed prompt formats—invite replication across varying class sizes, SE topics, genAI model versions, and more open-ended conversational structures. Development of automated, rubric-aligned triage for conversational quality assurance may further reduce instructor overhead and enhance student trust. Addressing the observed supportiveness gap, especially via persona prompt engineering, remains an open problem for maximizing the holistic efficacy of genAI tutoring systems in technical education contexts.
Conclusion
This work establishes that genAI-augmented, knowledge-base-grounded workflows support rapid acquisition of domain and modeling expertise in SE education, with measurable gains in self-efficacy and high reliability under carefully engineered instructional constraints. The playbook of practices distilled herein constitutes a practical foundation for further research and systems development in scalable, domain-adapted AI tutoring for design-focused engineering domains.