Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Next Token Knowledge Tracing (NTKT)

Updated 12 November 2025
  • NTKT is a framework that models student performance as a next-token generation problem, reframing traditional knowledge tracing with autoregressive techniques.
  • It leverages GPT-style causal transformers and LLMs augmented with LoRA, processing structured question-response sequences and rich text prompts for improved prediction.
  • Empirical evaluations show NTKT achieves superior metrics such as F1, AUC, and robustness in cold-start scenarios compared to conventional KT models.

Next Token Knowledge Tracing (NTKT) is a framework established for modeling student performance prediction as a next-token generation problem, aligning knowledge tracing with advances in large-scale autoregressive modeling. Recent work provides two distinct implementations: one employing GPT-style causal transformers on structured question–response tokens (Zhou et al., 17 Feb 2025), and one leveraging pretrained LLMs to integrate rich question text and interaction histories (Norris et al., 4 Nov 2025). Both approaches aim to advance predictive accuracy, cold-start robustness, and interpretability in educational AI systems.

1. Foundational Principles and Motivation

Traditional Knowledge Tracing (KT) methods, such as Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT), frame the student modeling problem as one of sequential state estimation, predicting future correctness based on interaction histories. Conventional systems typically use question identifiers and metadata (e.g., skill tags, timestamps) while omitting the context-rich natural language of the questions themselves. This ID-based scheme is constrained by the need for extensive manual labeling and suffers in cold-start scenarios when unseen items lack learned embeddings (Norris et al., 4 Nov 2025).

NTKT reframes KT as an autoregressive next-token modeling problem, analogous to causal language modeling. In NTKT, the student interaction sequence is treated as a generative process, and the system predicts the next response or outcome ("Correct"/"Incorrect") as the subsequent token given the entire preceding context. This perspective addresses both representation and generalization issues by enabling flexible input forms, ranging from structured alternations of question and response tokens (Zhou et al., 17 Feb 2025) to full text prompts exploiting the semantic knowledge encoded in LLMs (Norris et al., 4 Nov 2025).

2. Sequence Representation and Tokenization

AAKT (Zhou et al., 17 Feb 2025) organizes each student’s educational record into a length-$2T$ sequence interleaving question and response tokens:

x1=q1, x2=r1,,x2T1=qT, x2T=rTx_1 = q_1,\ x_2 = r_1,\,{\ldots},\, x_{2T-1} = q_T,\ x_{2T} = r_T

Joint probability is factorized autoregressively:

P(q1:T,r1:T)=i=12TP(xix<i)P(q_{1:T}, r_{1:T}) = \prod_{i=1}^{2T} P(x_i \mid x_{<i})

Question tokens are embedded via lookup tables and enriched with skill information via auxiliary classification, while response tokens concatenate correctness-conditional embeddings and continuous posterior features (e.g., response time). This structured alternation captures both anterior (before-response) and posterior (after-response) states.

NTKT with LLMs (Norris et al., 4 Nov 2025) encodes each student interaction as a long text sequence comprising historical questions and outcomes, skill tags, IDs, time stamps, and a target question wrapped in XML-style tags:

1
2
3
4
5
6
7
8
9
<history>
  <Q> ... question text ... </Q>
  <options> ... </options>
  <QID>#ID</QID>
  <C>Concept</C>
  <cr>Correct/Incorrect</cr>
  ...
</history>
<target> ... </target>

Tokenization employs byte-pair encoding matching the backbone LLM (e.g., LLaMA-3B), with special tags entered as unique tokens and metadata held as text. Sequences may reach lengths up to 15,000 tokens.

3. Model Architectures

AAKT utilizes a causal Transformer encoder resembling GPT-J: stacked masked self-attention blocks with rotary position embeddings. Input tokens are embedded according to their type (question or response), and causal masking ensures a prediction at step ii is conditioned only on tokens xix_{\leq i}.

NTKT (Norris et al., 4 Nov 2025) uses a decoder-only LLM backbone (e.g., LLaMA-3B or LLaMA-8B). Model weights Φ0\Phi_0 are frozen; Low-Rank Adaptation (LoRA) modules are introduced in every attention and MLP block, producing learnable updates ΔΦ(Θ)\Delta\Phi(\Theta) per layer. No additional classifier heads are employed—the pre-existing language modeling head predicts the next token directly, and at outcome positions, this resolves as “Correct” or “Incorrect.”

Architectural parameters for LLaMA-3B in NTKT:

Parameter Value
Layers 32 transformer decoders
Hidden dimension 4096
Attention heads 32
Vocabulary ~32K BPE tokens
LoRA rank rr $16$

4. Training Objectives and Procedures

AAKT (Zhou et al., 17 Feb 2025) involves a dual-objective loss:

  1. The primary loss Lpred\mathcal{L}_{\text{pred}} covers next-response prediction (even positions), formulated as binary cross-entropy using logits (vcorrect,vincorrect)(v_{\text{correct}}, v_{\text{incorrect}}) projected from hidden states.
  2. The auxiliary task Laux\mathcal{L}_{\text{aux}} ensures question representations encode skill information by minimizing KL divergence between the predicted and ground-truth skill distributions over embedded questions.

Combined:

L=Lpred+Laux\mathcal{L} = \mathcal{L}_{\text{pred}} + \mathcal{L}_{\text{aux}}

Sequences are windowed into overlapping segments (length LmaxL_{\text{max}}), augmenting training data; during inference, prediction masks prevent repeated evaluation on identical responses.

NTKT (Norris et al., 4 Nov 2025) applies masked cross-entropy loss on the outcome “Correct” or “Incorrect” tokens:

L(Θ)=i=1Nc=1ximi,clogP(xi,cxi,<c;Φ0+ΔΦ(Θ))\mathcal{L}(\Theta) = -\sum_{i=1}^{N}\sum_{c=1}^{|x_i|} m_{i,c}\,\log P(x_{i,c}\mid x_{i,<c};\,\Phi_0+\Delta\Phi(\Theta))

where mi,cm_{i,c} is $1$ when xi,c{“Correct”,“Incorrect”}x_{i,c}\in\{\text{“Correct”},\text{“Incorrect”}\} and $0$ elsewhere. LoRA modules update only Θ\Theta with backbone weights Φ0\Phi_0 held frozen. The fine-tuning loop uses AdamW (8-bit variant), cosine learning rate schedule, small batch sizes, and optionally 4-bit quantization to fit in GPU memory. Early stopping is based on stagnation of validation loss.

5. Empirical Evaluation and Performance Benchmarking

AAKT (Zhou et al., 17 Feb 2025) is assessed on EdNet-KT1, ASSISTments (2009, 2017), and Junyi datasets using AUC, ACC, and RMSE. Across all benchmarks, AAKT attains superior performance:

Dataset AAKT AUC Best Baseline AUC
EdNet-KT1 0.7827 0.7797
ASSISTments2009 0.7357 0.7303
ASSISTments2017 0.8018 0.7922
Junyi 0.8146 0.8126

Ablations indicate systematic degradation in AUC, ACC, and RMSE with removal of overlapping windowing, skill or time features, or auxiliary skill supervision, implicating each in performance gains.

NTKT (Norris et al., 4 Nov 2025) is evaluated on the Eedi dataset, yielding pronounced improvements over prior neural KT models:

Model F1 (%) Accuracy (%) ROC-AUC (%)
DKT (baseline) ~77.5 ~63.2 ~73.2
NTKT (LLaMA-3B) 90.20 86.70 95.72
NTKT (LLaMA-1B) ~88.3 ~93.4

Cold start scenarios demonstrate NTKT’s robustness: F1 remains \sim0.843 for both seen and unseen questions (no significant drop, p>0.5p>0.5), while baselines decline significantly (p<0.001p<0.001). Ablations show full question text yields greatest gains, notably in AUC calibration.

6. Analytical Studies and Interpretability

AAKT provides heat-map visualizations of self-attention patterns: model predictions concentrate on antecedent question–response pairs with greatest predictive utility. Position-wise AUC analysis demonstrates early prediction difficulty and subsequent adoption of stable performance, supporting the rationale for windowed overlapping input sequences. UMAP projections reveal that auxiliary skill supervision induces question embeddings to cluster according to underlying skill hierarchies.

NTKT’s ablation studies further reinforce the benefit of incorporating full question text, with performance increasing markedly compared to ID- and concept-only variants. Scale experiments suggest optimal results for the 3B model, with diminishing returns and potential overfitting when further increasing size, especially under quantization constraints.

7. Implementation, Reproducibility, and Practical Considerations

AAKT is implemented in PyTorch using a GPT-J base (2–4 blocks), Adam optimizer (10310^{-3}), embedding dimensions of $64$–$256$, overlapping input windows, and 5-fold cross-validation. All hyperparameter choices are reported, with window lengths tuned per dataset.

NTKT uses open-source LLaMA models (1B/3B/8B) and frozen weights augmented via LoRA, trained for up to 20,000 steps with batch size 16 (via accumulation), maximum sequence length 15,000 tokens, and hardware comprising a single NVIDIA A100 80GB GPU per run. Full code and preprocessing scripts are released for reproducibility, ensuring domain practitioners can replicate findings.

Overall, Next Token Knowledge Tracing advances KT by integrating autoregressive sequence modeling and pretrained LLM semantic representations, enabling accurate, generalizable predictions of student behaviour in adaptive educational environments (Zhou et al., 17 Feb 2025, Norris et al., 4 Nov 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Next Token Knowledge Tracing (NTKT).