Vocabulary Assistant LLM (GPT-4o)
- Vocabulary Assistant LLM (GPT-4o) is a data-driven system for vocabulary acquisition that uses a ‘Learning by Teaching’ approach to generate contextual errors and adaptive questions.
- It integrates five modular components—including a Materials Generator, Question Generator, and Learner Model—to personalize vocabulary learning based on real-time performance metrics.
- Empirical evaluations demonstrate a 5-15 percentage point improvement in retention rates, validating its effectiveness over static flashcard methods.
A Vocabulary Assistant LLM based on GPT-4o implements a data-driven “Learning by Teaching” (LbT) approach for vocabulary acquisition. This system utilizes GPT-4o to dynamically generate pedagogically effective errors, context-sensitive learner questioning, and tailored tutoring feedback, all orchestrated in a scalable pipeline. Empirical evidence supports measurable improvements in memory retention and shows that learner-specific adaptation is important for maximizing learning outcomes (Uchida et al., 20 Apr 2026).
1. System Architecture and Workflow
The core architecture comprises five integrated modules:
- Materials Generator (MG): A GPT-4o prompt ingests a target word or idiom and produces a plausible sentence with a misuse, an error explanation, and a list of possible corrections.
- Question Generator (QG): A GPT-4o prompt, acting as a “student,” asks context-sensitive questions derived from the misuse sentence, the learner’s previous explanation, and interaction history.
- Learner Model (LM): Maintains a per-word proficiency parameter () and a global cognitive load estimate (). Parameters are updated after each learner response.
- Feedback Engine (FE): Delivers targeted hints, explanations, and gradually increased scaffolding when prompted by learner errors or explicit requests.
- Data Store / Analytics (DS): Records all interactions (including word, question, answer, correctness, and input length), supporting post hoc analytics and algorithmic adaptation.
The data flow is as follows:
- Pretesting uses MCQs to identify unfamiliar words ().
- For each , the MG produces a misuse sentence , supporting evidence, and correction set.
- The learner reads and submits an explanation .
- The QG, conditioned on , supporting evidence, , recent question history, and learner profile, generates a follow-up question .
- The learner responds; FE provides hints or confirmation, while DS logs 0.
- LM updates 1 and 2 according to observed performance.
- Steps 4–6 iterate until the module’s stopping criterion is met (e.g., five corrections or three minutes elapsed per word).
- Retention is assessed immediately and at intervals (3 and 4 days).
Selection of training batches is governed by a scoring function:
5
where 6 are tunable, 7 is proficiency, and 8 time since last review. The top 9 words by score are selected per review batch.
2. Dynamic Question Generation and Adaptive Difficulty
Question generation leverages both prompt engineering and live adaptation:
- Prompt Templating: System messages define the “student” persona, target vocabulary, and correction roles. User messages inject 0, supporting evidence, 1, and the last 2 questions, minimizing repetition.
- Sampling Parameters: Sampling diversity is maintained with temperature 3 and top-p 4, adjusted for output variability and control.
- Difficulty Adaptation: The QG prompt reflects learner proficiency through a dynamic target difficulty:
5
As 6, the questions transition from basic definitions to novel contextual usage.
- Redundancy Control: Recent question history is embedded in the prompt to explicitly prevent repeated questioning.
This method addresses major limitations of prior work, in which question-generation relied on static templates and incurred significant development costs (Uchida et al., 20 Apr 2026).
3. Experimental Protocol and Evaluation Metrics
Pilot evaluation employed a within-subject, cross-over design with 7 university student participants. Each participant completed both the LbT-GPT-4o condition and a static flashcard (multiple-choice) baseline. The study sequence:
- Pretest: 30 Eiken-level vocabulary MCQs to determine 8.
- Posttest-1: Immediately follows learning of 10 assigned words (9).
- Posttest-2: Three days later, assessment for the next decile of words.
- Posttest-3: Seven days after initial exposure, for the final 10.
Retention rate per word at time 0 is calculated:
1
Mean retention rate:
2
Learning gain is immediate posttest minus pretest:
3
Statistical significance is evaluated by paired two-tailed t-tests on per-participant 4; 5 indicates significance.
Key Empirical Findings
- Immediate Posttest-1: LbT confers a modest 6 percentage point advantage.
- Posttest-2 (3 days): Retention rate 7 percentage points higher under LbT.
- Posttest-3 (7 days): 8 percentage point retention advantage versus baseline.
4. Personalization and Learner Modeling
Personalization derives from both real-time interaction logs and learner-reflective self-reports.
Traits Associated with Learning Outcomes
- Active Engagement: Measured as high average input length; positively correlated with 9 (LbT outperformance).
- Cognitive Load Sensitivity: Participants with lower 0 (lower average time per correction) sustain greater retention.
- Metacognitive Awareness & Intrinsic Motivation: Higher self-rated “efficiency” and “motivation” (survey Q3/Q4) predict increased learning gain.
Adaptive Strategies
| Challenge Detected | System Adaptation | Metric/Trigger |
|---|---|---|
| Low engagement (input 1 2) | Simpler, multiple-choice questions | Avg. words per response |
| High cognitive load (3 high) | Reduced question frequency, scaffolding “hint mode” | Response time or cognitive load 4 |
| Reported high effort | Increase review interval, lower question difficulty | Self-report (1–5 scale after each block) |
The system uses a lightweight Bayesian update for word proficiency:
5
with learning rate 6, for 7.
5. Module Specifications and Prompt-Engineering Recommendations
Question Generator: Issues GPT-4o API calls with dynamically programmed persona and dialogue context, maintaining a context window of 8 turns.
Learner Model: Implements continuous 9 and 0 estimation, with Bayesian updating for each word and session.
Feedback Engine:
- Upon hint request, GPT-4o is prompted for scaffolded clues tailored to the actual learner answer and the corresponding misuse sentence.
- Repeated errors on a word trigger insertion of additional usage examples in subsequent reviews.
Prompt Engineering:
- Incorporate few-shot examples spanning question difficulty.
- Dynamically adjust temperature and top-p (e.g., begin with 1 for diverse output; decrease to 2 where 3).
- Explicit redundancy prevention (“do not repeat previous questions”), with recent question history passed to GPT-4o context.
Cost is managed by limiting corrections (e.g., 4 per word) and batching API calls.
6. Impact, Scalability, and Empirical Outcomes
Deployment is entirely based on cloud-accessed GPT-4o prompts without the need for manually authored templates, facilitating scalable and cost-effective operation. The pilot study anticipates an average 5–6 percentage point improvement in 3-day and 7-day retention rates per learner. All system modules are orchestrated via API and interaction data is centrally logged, supporting both real-time personalization and large-scale analytics.
The architecture thus operationalizes the LbT paradigm for vocabulary acquisition, overcoming legacy limitations in question diversity, development cost, and adaptability, and yielding quantitatively validated gains in retention and learner engagement (Uchida et al., 20 Apr 2026).