Papers
Topics
Authors
Recent
2000 character limit reached

OpenGloss: AI Lexicon and Semantic Graph

Updated 30 November 2025
  • OpenGloss is a synthetically generated encyclopedic dictionary and semantic knowledge graph that unifies definitions, etymologies, usage examples, and semantic relations.
  • It employs a multi-agent LLM pipeline with strict schema validation to produce 150,101 lexemes and 536,829 senses with extensive collocations and examples.
  • The resource supports diverse applications in NLP, education, and semantic analysis by offering rich contextual data and pedagogical clarity.

OpenGloss is a synthetically generated encyclopedic dictionary and semantic knowledge graph for the English language that unifies lexicographic sense definitions, encyclopedic prose, etymological history, usage examples, collocations, and rich semantic relationships. Leveraging a rigorously schema-validated, multi-agent LLM pipeline, OpenGloss achieves the scale and semantic density of leading lexical resources while providing deeper pedagogical and encyclopedic content. The resource comprises 150,101 lexemes and 536,829 senses—on par with, and in many respects surpassing, traditional manually curated lexical databases in both coverage and content richness. OpenGloss is publicly available under an open CC-BY 4.0 license (Bommarito, 23 Nov 2025).

1. Resource Design and Composition

OpenGloss was motivated by longstanding trade-offs in traditional lexical resources among vocabulary breadth, depth of annotation, currency, and cost of curation. Its lexicon comprises 150,101 entries (94,106 single words; 55,995 multi-word expressions), drawn from a filtered "wamerican" American English wordlist, further expanded via LLM-driven snowballing seeded on K–12 educational concepts. The sense inventory encompasses 536,829 distinct senses, yielding an average of 3.58 senses per lexeme (sˉ=3.58\bar s = 3.58, median = 3, maximum = 24), constrained to 1–4 senses per part of speech to favor pedagogical utility over extreme granularity.

The semantic network contains 9.14 million typed, directed edges, including 1.6M synonymy, 1.1M antonymy, 1.1M hypernymy, and 1.4M hyponymy sense-level relations, in addition to 3.1M collocations and 875K inflections. Each sense typically includes 1–3 authentic usage examples, with a total of approximately 1 million example sentences. Further, 99.7% of senses are supplemented with 200–400 word encyclopedic context, comprising 60 million words overall, while 97.5% of lexemes are annotated with full etymological trails capturing cognates, semantic evolution, and key citations.

2. Generation Pipeline and Automated Quality Control

The OpenGloss procedural pipeline executes four major stages, orchestrated via pydantic-ai using strict schema validation with Pydantic V2:

  1. Lexeme Selection: Initial vocabulary is seeded with ≈104,000 words from "wamerican" (filtered by length, alphabetic content, and pedagogical relevance) and further expanded by 76,901 additional terms proposed from an LLM-generated neighbor graph.
  2. Sense Generation: Each lexeme invokes two collaborating agents. An overview agent determines applicable POS tags, classifies stopwords, and proposes the sense count. A POS-details agent generates 1–4 senses per POS, each with a concise 50–150 character definition, paradigmatic relations (synonyms, antonyms, hypernyms, hyponyms), usage examples, morphological forms, and 3–6 collocations. All agent outputs are instantly schema-validated; invalid responses (2–4%) trigger automated re-prompting.
  3. Graph Construction: All semantic and morphological relations are deterministically extracted from sense and POS data with no further LLM involvement. Graph validation enforces acyclicity of taxonomies, reference integrity, and pairwise symmetry for antonymy/synonymy.
  4. Enrichment: Dedicated etymology and encyclopedia agents generate 200–400 word narratives for 97.5% of lexemes and encyclopedic entries for 99.7% of senses, both subject to schema validation and merging.

A final LLM-based quality assurance (Anthropic Claude Sonnet 4.5) is applied to a sample of 1,000 entries, scoring for lexicographic structure, definitional clarity, encyclopedic accuracy, etymology plausibility, and semantic relation validity. Reported rates are 14.1% "High Confidence", 17.1% "Acceptable with Minor Issues", and 68.8% flagged (majority due to deliberate WordNet-inspired design, such as inflected forms and proper nouns). Clean rates on core content are: definitions 62%, encyclopedias 79%, etymologies 74%, usage examples 65%.

The entire pipeline completes in under 96 wall-clock hours using only OpenAI gpt-5-nano API endpoints, with compute costs under US$1,000 (Bommarito, 23 Nov 2025).

3. Comparative Analysis with Other Lexical Resources

OpenGloss directly addresses the stagnation and schema limitations of manually curated resources such as Princeton WordNet, which, despite gold-standard distinctions, has not seen a major downloadable update since 2006 and omits encyclopedic and etymological context. The lemma count in OpenGloss (150,101) matches WordNet (147,306) and Open English WordNet 2024 (152,000), but OpenGloss provides 4.59× more sense definitions than WordNet (536,829 vs. 117,659).

OpenGloss and WordNet complement each other: only 38% of their lemma lists overlap (56,637 shared words). WordNet targets fine-grained expert curation (e.g., 57 synsets for "run"), whereas OpenGloss supplies coarser but more pedagogically oriented glosses, with broader encyclopedic content and use-case-driven examples (e.g., 6 senses for "run"). Crowdsourced and integration-based resources like BabelNet and ConceptNet provide larger-scale or broader-commonsense coverage but suffer from heterogeneity, weaker schema alignment, and less reliable quality control.

4. Data Model, Schema, and Semantic Network

The OpenGloss schema integrates:

  • Lexeme and Sense Structure: Each lexeme is mapped to multiple senses, each characterized by part of speech, definition, usage examples, inflectional/derivational morphology, and contextual collocations.
  • Semantic Edges: Directed, typed relationships at sense-level (synonymy, antonymy, hypernymy, hyponymy: total 5.20M edges) and POS-level (collocations, inflections: 3.94M edges). Graph validation guarantees acyclicity and ensures that antonymy and synonymy are symmetric.
  • Enrichment Fields: Each sense is linked to a 200–400 word encyclopedic entry (60M words total), and each lexeme is annotated with an etymological narrative for 97.5% of cases.
  • Quality and Validation: All agent outputs are constrained by strict JSON schema; errorful or nonconformant data triggers automatic retries with prompt modulation until compliance is achieved.

The vocabulary and sense statistics can be formally summarized as:

L=150101,LS()=536829,sˉ=1LS()3.58|L| = 150\,101,\quad \sum_{\ell\in L} |S(\ell)| = 536\,829,\quad \bar s = \frac{1}{|L|}\sum_{\ell} |S(\ell)| \approx 3.58

5. Applications and Impact

The unique integration of definitions, semantic relations, authentic examples, encyclopedic context, and etymological history in a single schema positions OpenGloss for a wide range of research, pedagogical, and NLP applications:

  • Vocabulary learning platforms and adaptive tutoring systems benefit from the pedagogical orientation, integrated context, and expansive collocation data.
  • Reading-comprehension and assistive tools gain from on-demand concept summaries that combine dictionary and encyclopedic perspectives.
  • Curriculum and glossary generation is facilitated by the co-availability of definitions, encyclopedic context, and collocational information.
  • Natural language processing research: The resource enables new benchmarks for word sense disambiguation (with ~1M sense-tagged examples), semantic similarity measurement, lexical substitution, and knowledge-grounded language modeling; it also provides a synthetic-vs-manual resource baseline for ontology learning tasks.
  • Knowledge-enhanced retrieval and modeling: Taxonomic and encyclopedic embeddings offer a testbed for semantic graph pretraining and large-scale language modelling.

OpenGloss's resource composition and speed/cost advantages—generation at WordNet-scale in under a week and for less than $1,000—demonstrate that individual research teams can feasibly generate and maintain comprehensive lexical resources without institutional infrastructure (Bommarito, 23 Nov 2025).

6. Limitations and Future Directions

OpenGloss is subject to several limitations that reflect the maturity of current generation models and resource conception:

  • Semantic relations may lack the expert curation and precision of traditional lexicographic databases, especially for nuanced or borderline cases.
  • The imposed sense granularity (1–4 per POS) prioritizes pedagogical tractability over exhaustive expert-level distinctions.
  • Etymology and encyclopedic texts are plausible educational narratives assembled by generative models; they do not provide peer-reviewed, scholarly citations.
  • Biases may arise from the LLM’s training data, including regional, cultural, or dialectical underrepresentation.

Future enhancements include comprehensive human expert evaluation (200+ lexemes with inter-annotator agreement), benchmark evaluation on WSD and semantic similarity tasks, multilingual and domain-specific extensions (medical, legal, STEM), robust cross-resource mapping (WordNet, BabelNet, Wikidata), and open-sourcing the generation pipeline to support collaborative community extensions and continual updates.

OpenGloss illustrates that systematically engineered, LLM-driven procedural generation can offer practical, high-coverage, richly annotated semantic resources at unprecedented cost and time scales—expanding the design space of lexical databases and supporting continually evolving language understanding systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OpenGloss.