SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Published 24 Apr 2026 in cs.CL | (2604.22134v1)

Abstract: LLMs have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a knowledge mastery graph that formalizes prerequisite relationships to enforce pedagogical constraints and prevent jailbreaks.
It presents a graph-augmented tutoring pipeline that dynamically gates responses, significantly boosting safety under adversarial settings.
Experimental evaluation using the SHAPE benchmark demonstrates dramatic safety improvements (e.g., from 0% to 92.25% for Qwen3-80B) without compromising helpfulness.

Unifying Safety, Helpfulness, and Pedagogy in Educational LLMs: The SHAPE Framework

Formalizing Pedagogical Vulnerabilities and the Knowledge Mastery Graph

The paper "SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs" (2604.22134) addresses fundamental limitations in the deployment of LLMs as educational tutors, specifically their susceptibility to pedagogical jailbreaks. These jailbreaks represent prompt-based adversarial attacks wherein students induce models to furnish direct answers, rather than scaffolded instructional support, thereby circumventing intended pedagogical guardrails.

To systematically analyze and defend against these vulnerabilities, the authors introduce a formal knowledge mastery graph framework. Each concept is modeled as a node in a directed acyclic graph, with edges encoding prerequisite relationships. Student mastery states are defined as subsets of nodes reflecting mastered concepts; unsafe behaviors—such as revealing direct answers before mastery—are formalized as violations of prerequisite completeness for a given query.

This principled approach establishes output constraints for instructional LLMs, distinguishing between problem-solving and pedagogical modes through explicit gating mechanisms. Concept-targeted pedagogical responses are permitted only when mastery gaps are present, and direct answers are allowed exclusively when all relevant prerequisites are demonstrated by the student.

Figure 1: Representative LLM-tutor interactions under mastery-aware and jailbreak conditions, illustrating when guided instruction or direct answers are appropriate relative to student knowledge.

SHAPE Benchmark: Quantitative Evaluation of Pedagogical Robustness

The SHAPE benchmark is constructed to rigorously evaluate educational LLMs across safety, helpfulness, and pedagogy, encompassing 9,087 student-question pairs in linear algebra with systematically varied knowledge states. The dataset is curated from Big-Math, annotated with step-level concept mappings, and augmented with mastery state simulation to ensure prerequisite consistency.

Metrics are operationalized as follows:

Safety: Fraction of cases where the model appropriately withholds direct solutions given mastery gaps.
Helpfulness: Fraction of cases where direct solutions are provided when prerequisite mastery is achieved.
Pedagogy: Conditional probability of pedagogical (concept-targeted) questioning during safe refusals.

Adversarial settings—specifically, refusal suppression and role-play based pedagogical jailbreaks—are designed to stress-test model robustness. Results demonstrate that state-of-the-art LLMs (e.g., Gemini, GPT-5, Claude) generally maintain high safety in non-jailbreak settings but suffer significant safety degradation (drops of 30–99%) under adversarial prompts. Model architectures relying solely on pedagogical chain-of-thought prompting or fine-tuning are notably ineffective when confronted with these attack vectors.

Graph-Augmented Pedagogical Pipeline: Architectural Defense

To mitigate pedagogical jailbreak vulnerabilities, the authors propose a graph-augmented tutoring pipeline. The system parses student queries to extract prerequisite concepts, identifies missing knowledge via mastery state comparison, and routes responses through an explicit gating mechanism: instructive scaffolding or direct solution provision as dictated by inferred mastery.

Figure 2: System overview of graph-augmented adaptive teaching, with prerequisites parsing and dynamic response gating based on student mastery.

Experimental evaluation of this pipeline across multiple LLMs yields substantial robustness improvements. For example, Qwen3-80B's worst-case safety rises from 0% (baseline) to 92.25% under the pipeline. Importantly, these gains are achieved without indiscriminate refusals; helpfulness metrics remain near-ceiling, confirming that the pipeline enables contextually adaptive answer provision rather than suppressing assistance altogether.

Furthermore, pedagogical output quality is preserved, with high rates of concept-targeted instruction under adversarial conditions. The robustness improvements are especially marked for larger models, whereas smaller models exhibit more limited gains—a phenomenon attributed to weaker instruction-following capacities.

Pedagogical Jailbreak Taxonomy and Defense Efficacy

The paper offers a taxonomy of pedagogical jailbreaks, distinguishing between white-box (e.g., gradient-based) and black-box (e.g., prompt rewriting, instruction suppression, emotional coercion) attacks. Empirical findings indicate that prompt rewriting (competing objectives, such as role-play or refusal suppression) poses the greatest threat in educational contexts, in contrast to mismatched generalization attacks commonly encountered in general LLM safety research.

The graph-augmented pipeline not only demonstrates efficacy against canonical attacks but also proves robust against diverse black-box strategies, including psychological coercion and instructional constraint jailbreaks. Safety rates remain consistently high in these alternative threat models.

Practical and Theoretical Implications

This work expands the notion of LLM safety beyond toxicity and restricted content, capturing instructional risks that arise from premature answer provision. By formalizing the instructional objective mismatch—where providing direct answers undermines durable learning—the authors advocate for pedagogically-grounded architectures in educational LLMs.

The knowledge mastery graph serves as a foundational abstraction for modeling, evaluating, and defending pedagogical behaviors. Its extensibility facilitates integration with RL reward modeling (e.g., compositional reasoning on knowledge graphs, trajectory-level credit assignment) and adaptive tutoring systems that dynamically infer student mastery states.

Architectural enforcement of pedagogical constraints is shown to outperform prompt engineering and pedagogical fine-tuning alone, demonstrating the necessity of embedding pedagogical logic at the system level. The SHAPE benchmark provides a rigorous foundation for future comparative evaluation of educational LLMs and defenses.

Future Directions

Several avenues are proposed for further research:

Extension of mastery state modeling to partial proficiency, misconceptions, and non-hierarchical knowledge domains.
Dynamic updating of mastery states via multi-turn interactions and knowledge tracing.
RL-driven optimization of pedagogical trajectories leveraging the knowledge mastery graph.
Automated extraction and evaluation of concept-targeted instruction from LLM outputs.

Integration of graph-based reward modeling and dynamic knowledge expansion could enable truly adaptive, resilient educational tutors for diverse curricular domains.

Conclusion

The SHAPE framework establishes unified, formal definitions of safety, helpfulness, and pedagogy for educational LLMs, anchored in the knowledge mastery graph abstraction. Through systematic adversarial evaluation and architectural intervention, the paper demonstrates the inadequacy of conventional pedagogical prompting and fine-tuning strategies, and substantiates the necessity of pipeline-based response gating. Results suggest that embedding pedagogical logic at the system level is essential for robust, adaptive educational AI. Future research can leverage this foundation to advance both the theoretical modeling and practical deployment of instructional LLMs.

Markdown Report Issue