Next Token Knowledge Tracing (NTKT)

Updated 12 November 2025

NTKT is a framework that models student performance as a next-token generation problem, reframing traditional knowledge tracing with autoregressive techniques.
It leverages GPT-style causal transformers and LLMs augmented with LoRA, processing structured question-response sequences and rich text prompts for improved prediction.
Empirical evaluations show NTKT achieves superior metrics such as F1, AUC, and robustness in cold-start scenarios compared to conventional KT models.

Next Token Knowledge Tracing (NTKT) is a framework established for modeling student performance prediction as a next-token generation problem, aligning knowledge tracing with advances in large-scale autoregressive modeling. Recent work provides two distinct implementations: one employing GPT-style causal transformers on structured question–response tokens (Zhou et al., 17 Feb 2025), and one leveraging pretrained LLMs to integrate rich question text and interaction histories (Norris et al., 4 Nov 2025). Both approaches aim to advance predictive accuracy, cold-start robustness, and interpretability in educational AI systems.

1. Foundational Principles and Motivation

Traditional Knowledge Tracing (KT) methods, such as Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT), frame the student modeling problem as one of sequential state estimation, predicting future correctness based on interaction histories. Conventional systems typically use question identifiers and metadata (e.g., skill tags, timestamps) while omitting the context-rich natural language of the questions themselves. This ID-based scheme is constrained by the need for extensive manual labeling and suffers in cold-start scenarios when unseen items lack learned embeddings (Norris et al., 4 Nov 2025).

NTKT reframes KT as an autoregressive next-token modeling problem, analogous to causal language modeling. In NTKT, the student interaction sequence is treated as a generative process, and the system predicts the next response or outcome ("Correct"/"Incorrect") as the subsequent token given the entire preceding context. This perspective addresses both representation and generalization issues by enabling flexible input forms, ranging from structured alternations of question and response tokens (Zhou et al., 17 Feb 2025) to full text prompts exploiting the semantic knowledge encoded in LLMs (Norris et al., 4 Nov 2025).

2. Sequence Representation and Tokenization

AAKT (Zhou et al., 17 Feb 2025) organizes each student’s educational record into a length-$2T$ sequence interleaving question and response tokens:

$x_1 = q_1,\ x_2 = r_1,\,{\ldots},\, x_{2T-1} = q_T,\ x_{2T} = r_T$

Joint probability is factorized autoregressively:

$P(q_{1:T}, r_{1:T}) = \prod_{i=1}^{2T} P(x_i \mid x_{<i})$

Question tokens are embedded via lookup tables and enriched with skill information via auxiliary classification, while response tokens concatenate correctness-conditional embeddings and continuous posterior features (e.g., response time). This structured alternation captures both anterior (before-response) and posterior (after-response) states.

NTKT with LLMs (Norris et al., 4 Nov 2025) encodes each student interaction as a long text sequence comprising historical questions and outcomes, skill tags, IDs, time stamps, and a target question wrapped in XML-style tags:

<history>
  <Q> ... question text ... </Q>
  <options> ... </options>
  <QID>#ID</QID>
  <C>Concept</C>
  <cr>Correct/Incorrect</cr>
  ...
</history>
<target> ... </target>

Tokenization employs byte-pair encoding matching the backbone LLM (e.g., LLaMA-3B), with special tags entered as unique tokens and metadata held as text. Sequences may reach lengths up to 15,000 tokens.

3. Model Architectures

AAKT utilizes a causal Transformer encoder resembling GPT-J: stacked masked self-attention blocks with rotary position embeddings. Input tokens are embedded according to their type (question or response), and causal masking ensures a prediction at step $i$ is conditioned only on tokens $x_{\leq i}$ .

NTKT (Norris et al., 4 Nov 2025) uses a decoder-only LLM backbone (e.g., LLaMA-3B or LLaMA-8B). Model weights $\Phi_0$ are frozen; Low-Rank Adaptation (LoRA) modules are introduced in every attention and MLP block, producing learnable updates $\Delta\Phi(\Theta)$ per layer. No additional classifier heads are employed—the pre-existing language modeling head predicts the next token directly, and at outcome positions, this resolves as “Correct” or “Incorrect.”

Architectural parameters for LLaMA-3B in NTKT:

Parameter	Value
Layers	32 transformer decoders
Hidden dimension	4096
Attention heads	32
Vocabulary	~32K BPE tokens
LoRA rank $r$	$16$

4. Training Objectives and Procedures

AAKT (Zhou et al., 17 Feb 2025) involves a dual-objective loss:

The primary loss $\mathcal{L}_{\text{pred}}$ covers next-response prediction (even positions), formulated as binary cross-entropy using logits $(v_{\text{correct}}, v_{\text{incorrect}})$ projected from hidden states.
The auxiliary task $\mathcal{L}_{\text{aux}}$ ensures question representations encode skill information by minimizing KL divergence between the predicted and ground-truth skill distributions over embedded questions.

Combined:

$\mathcal{L} = \mathcal{L}_{\text{pred}} + \mathcal{L}_{\text{aux}}$

Sequences are windowed into overlapping segments (length $L_{\text{max}}$ ), augmenting training data; during inference, prediction masks prevent repeated evaluation on identical responses.

NTKT (Norris et al., 4 Nov 2025) applies masked cross-entropy loss on the outcome “Correct” or “Incorrect” tokens:

$\mathcal{L}(\Theta) = -\sum_{i=1}^{N}\sum_{c=1}^{|x_i|} m_{i,c}\,\log P(x_{i,c}\mid x_{i,<c};\,\Phi_0+\Delta\Phi(\Theta))$

where $m_{i,c}$ is $1$ when $x_{i,c}\in\{\text{“Correct”},\text{“Incorrect”}\}$ and $0$ elsewhere. LoRA modules update only $\Theta$ with backbone weights $\Phi_0$ held frozen. The fine-tuning loop uses AdamW (8-bit variant), cosine learning rate schedule, small batch sizes, and optionally 4-bit quantization to fit in GPU memory. Early stopping is based on stagnation of validation loss.

5. Empirical Evaluation and Performance Benchmarking

AAKT (Zhou et al., 17 Feb 2025) is assessed on EdNet-KT1, ASSISTments (2009, 2017), and Junyi datasets using AUC, ACC, and RMSE. Across all benchmarks, AAKT attains superior performance:

Dataset	AAKT AUC	Best Baseline AUC
EdNet-KT1	0.7827	0.7797
ASSISTments2009	0.7357	0.7303
ASSISTments2017	0.8018	0.7922
Junyi	0.8146	0.8126

Ablations indicate systematic degradation in AUC, ACC, and RMSE with removal of overlapping windowing, skill or time features, or auxiliary skill supervision, implicating each in performance gains.

NTKT (Norris et al., 4 Nov 2025) is evaluated on the Eedi dataset, yielding pronounced improvements over prior neural KT models:

Model	F1 (%)	Accuracy (%)	ROC-AUC (%)
DKT (baseline)	~77.5	~63.2	~73.2
NTKT (LLaMA-3B)	90.20	86.70	95.72
NTKT (LLaMA-1B)	~88.3	—	~93.4

Cold start scenarios demonstrate NTKT’s robustness: F1 remains $\sim$ 0.843 for both seen and unseen questions (no significant drop, $p>0.5$ ), while baselines decline significantly ( $p<0.001$ ). Ablations show full question text yields greatest gains, notably in AUC calibration.

6. Analytical Studies and Interpretability

AAKT provides heat-map visualizations of self-attention patterns: model predictions concentrate on antecedent question–response pairs with greatest predictive utility. Position-wise AUC analysis demonstrates early prediction difficulty and subsequent adoption of stable performance, supporting the rationale for windowed overlapping input sequences. UMAP projections reveal that auxiliary skill supervision induces question embeddings to cluster according to underlying skill hierarchies.

NTKT’s ablation studies further reinforce the benefit of incorporating full question text, with performance increasing markedly compared to ID- and concept-only variants. Scale experiments suggest optimal results for the 3B model, with diminishing returns and potential overfitting when further increasing size, especially under quantization constraints.

7. Implementation, Reproducibility, and Practical Considerations

AAKT is implemented in PyTorch using a GPT-J base (2–4 blocks), Adam optimizer ( $10^{-3}$ ), embedding dimensions of $64$–$256$, overlapping input windows, and 5-fold cross-validation. All hyperparameter choices are reported, with window lengths tuned per dataset.

NTKT uses open-source LLaMA models (1B/3B/8B) and frozen weights augmented via LoRA, trained for up to 20,000 steps with batch size 16 (via accumulation), maximum sequence length 15,000 tokens, and hardware comprising a single NVIDIA A100 80GB GPU per run. Full code and preprocessing scripts are released for reproducibility, ensuring domain practitioners can replicate findings.

Overall, Next Token Knowledge Tracing advances KT by integrating autoregressive sequence modeling and pretrained LLM semantic representations, enabling accurate, generalizable predictions of student behaviour in adaptive educational environments (Zhou et al., 17 Feb 2025, Norris et al., 4 Nov 2025).