Learner's Dictionary Definition Generation (LDDG)
- LDDG is a method for generating dictionary definitions that use only a predefined, limited vocabulary, ensuring lexical simplicity.
- It plays a key role in scalable learner’s lexicography and enhances evaluation of language models in resource-limited educational contexts.
- Modern LLMs and iterative simplification techniques, such as IterSim, achieve high truthfulness, coverage, and guideline compliance despite challenges like sense specificity.
Learner’s Dictionary Definition Generation (LDDG) is the task of automatically producing dictionary definitions for headwords using only a constrained, pedagogy-oriented “defining vocabulary.” LDDG is distinguished by its strict lexical simplicity constraint: outputs must be both semantically accurate and entirely composed of words from a predefined, limited list, such as the most frequent 16,000 tokens in a target language. LDDG is central to scalable learner’s lexicography, pedagogical resource creation, and the interpretability and evaluation of LLMs in educational or resource-limited domains.
1. Formal Task Definition and Operational Constraints
LDDG is a special case of dictionary definition generation (DDG), mapping a headword (with part-of-speech , optionally reading ) to a set of non-contextualized human-readable definitions that collectively cover the major senses of . Each must satisfy the defining vocabulary constraint: , where is a fixed, curated list (e.g., TUBE16K for Japanese). Thus, the mapping enforced is:
Unlike general DDG, where unrestricted lexical choice is permitted, LDDG requires explicit mechanisms—architectural, training, or post-hoc—to ensure the absence of out-of-vocabulary words in generated definitions (Ide et al., 5 Jan 2026).
2. Evaluation Protocols and Criteria: Beyond Surface Metrics
Standard automatic metrics such as BLEU and BERTScore, while frequently used for DDG evaluation, do not adequately measure the semantic and pedagogical qualities relevant for LDDG. To address this, LDDG evaluation employs a multitiered rubric:
- Truthfulness (T): Proportion of system-generated definitions corresponding to real senses in a reference set, i.e., a precision-style measure.
- Coverage (C): Recall-style criterion quantifying how many reference senses are realized in the system outputs.
- Sense Specificity (S): Adjusts for redundant or overlapping definitions; penalizes merges of system outputs covering the same sense.
- Guideline Compliance (G): Measures adherence to explicit style protocols (parentheses, usage labels, sentence type).
The use of LLMs as judges (LLM-as-judge; e.g., GPT-5.1) enables the reliable scoring of these criteria. Human–machine agreement (e.g., Kendall’s overall, exceeding BLEU at .174) demonstrates the robust alignment of LLM evaluation with expert annotators (Ide et al., 5 Jan 2026).
3. Model Architectures and Enforcing Defining Vocabulary Constraints
Baseline LDDG systems are often built atop LLMs accessed through zero- or few-shot prompting. The central architectural challenge—enforcing strict lexical simplicity—necessitates post-generation filtering or iterative simplification.
Iterative Simplification Algorithm ("IterSim"):
Given initial definitions (from a general model, e.g., GPT-5.1 or Claude 4.5), IterSim identifies out-of-vocabulary tokens, and iteratively prompts the LLM to rewrite the definition, explicitly banning these terms and any new complex words. Each simplification step must reduce the count of OOV words, ensuring convergence to -compliant outputs:
- Compute .
- For each in , rewrite the current definition banning and any further introduced complex words.
- Terminate when all definitions are -compliant.
This process allows leveraging the semantic richness of unconstrained LLMs while guaranteeing the pedagogical constraint central to LDDG (Ide et al., 5 Jan 2026).
4. Dataset Construction and Benchmarking
A representative LDDG resource is the D3J dataset (Japanese), where each headword’s definitions were authored using only TUBE16K (top 16,000 tokens from YouTube subtitles), vetted by an expert lexicographer. D3J features 325 headwords covering 546 major senses, balanced across parts-of-speech and frequency bands. Wiktionary definitions cover the same headwords, but only 66.9% of Wiktionary outputs fall within TUBE16K, demonstrating the necessity of tailored LDDG curation.
Quantitative benchmarks on D3J show that IterSim applied to few-shot prompted Claude 4.5 yields average scores (of 100) across all major criteria (T, C, S, G) with 99\% of outputs TUBE16K-compliant, compared to Wiktionary's (truthfulness) and $76$ (specificity), confirming that post-hoc simplification substantially outperforms both unconstrained dictionary entries and baseline model generations (Ide et al., 5 Jan 2026).
5. Empirical Findings and Challenges
Key empirical findings for LDDG include:
- LLMs as LDDG Engines: Proprietary models (Claude Sonnet 4.5, GPT-5.1) robustly outperform open-weight alternatives (Qwen3-32B, Llama3.3–Swallow–70B) by 20–30 points across criteria, reinforcing the necessity of frontier models for pedagogically adequate definitions.
- Prompting Regimes: Few-shot prompting consistently raises both overall LDDG scores and critical sub-criteria, notably specificity (+14 absolute for Claude 4.5).
- IterSim Effectiveness: Iterative simplification achieves near-perfect defining vocabulary compliance with negligible loss in semantic fidelity or coverage.
- Sense Specificity as Open Problem: Even with strong prompting and simplification, sense specificity (i.e., avoiding sense overlap in outputs) remains the most challenging LDDG criterion for both humans and LLMs; absolute levels plateau around 88 for best systems, with persistent semantic overlaps due to similar phrasing (Ide et al., 5 Jan 2026).
- Limits of Automatic Metrics: BLEU and BERTScore exhibit weak correlation with human or LLM-as-judge rubric scores (e.g., Kendall’s and $0.268$ versus for LLM), underscoring the need for explicit semantics-focused evaluation in LDDG.
6. Methodological Limitations and Future Directions
Outstanding limitations for LDDG research include:
- Generalizability: Current frameworks and datasets (e.g., D3J) are language-specific; validation across typologically diverse languages and for more extreme vocabulary constraints remains an open area.
- Rubric Refinement: Inter-annotator agreement for truthfulness and specificity is moderate (), signaling ambiguity in both sense delineation and rubric interpretation; improved, operationalized evaluation protocols are needed.
- Dependence on Proprietary LLMs: Scoring and high-accuracy definition generation depend on non-public models, which restricts reproducibility and broad research access.
- Polysemy and Sense Disambiguation: More precise mechanisms for explicit sense separation (e.g., polysemy detection, contrastive or sense-aware architectures) are crucial for advancing specificity and coverage in LDDG.
- Integrating Context, Examples, and Usage: Next-generation LDDG systems should jointly generate usage examples, sense orderings, and usage notes, further aligning with pedagogical lexicography.
Research priorities include expanding LDDG coverage to additional languages, refining rubrics for human–machine evaluation, developing closed-loop training to optimize under multiple criteria, and incorporating explicit sense-differentiation modules to target the specificity bottleneck (Ide et al., 5 Jan 2026).
References
- "Towards Automated Lexicography: Generating and Evaluating Definitions for Learner's Dictionaries" (Ide et al., 5 Jan 2026)