Neologism Learning

Updated 10 October 2025

Neologism Learning is the process by which novel lexical items are created, adopted, and integrated into linguistic and computational frameworks.
It involves mechanisms such as invention and imitation, with models like agent-based simulations and game-theoretic approaches explaining viral diffusion and semantic shifts.
Applications span from tracking language evolution in historical corpora to fine-tuning large language models for precise lexical control and robustness.

Neologism learning is the process by which individuals, communities, artificial cognitive systems, or machine learning models acquire, generate, diffuse, and integrate new lexical items—word forms, meanings, or conceptual representations—into their communicative repertoire. This domain encompasses linguistic, sociological, computational, and interdisciplinary mechanisms and is critical for tracing language evolution, controlling model behaviors, and expanding human–machine communication. Neologisms arise in response to conceptual gaps or communicative demands and fulfill semantic, pragmatic, or functional needs in natural and artificial languages.

1. Mechanisms of Neologism Creation and Diffusion

Neologism creation operates via two intertwined mechanisms: invention and imitation (Paradowski et al., 2011). Invention entails the spontaneous generation of new lexemes by individual agents, frequently driven by increased creativity or by the necessity to represent novel concepts (as illustrated in digital networks, professional jargon, or social upheavals) (Muravyev et al., 2018, Säily et al., 2021).

Imitation refers to the adoption and propagation of neologisms following their initial creation. The threshold model of collective behavior posits that individual agents will adopt a neologism after a critical proportion of their peers have done so, formalized by the exposure threshold:

$B_u = \frac{\sum A(e_{u \rightarrow x})}{|H(u)|}$

where $A(e_{u \rightarrow x})$ is the count of agent $u$ ’s neighbors who had adopted tag $x$ at the time of $u$ 's adoption, and $|H(u)|$ is the total neighbor count (Paradowski et al., 2011). Most users on social platforms display low thresholds, facilitating rapid viral spread of new forms.

Systems employing agent-based and game-theoretic modeling capture these dynamics, accounting for heterogeneous neighbor counts, interaction intensities, prestige, and network structure, which collectively drive the institutionalization of neologisms.

2. Linguistic Processes and Formal Models

Neologism formation often exploits word-formation processes such as affixation, compounding, conversion, and borrowing. Studies on Russian (Muravyev et al., 2018) and English (Säily et al., 2021) show that:

Suffixation and prefixation: New words created by attaching morphemes, formalized as

$\text{New Word} = \text{Prefix} + \text{Stem} + \text{Suffix}$

Compounding: Combining stems (e.g., "ST_1-ST_2") produces new concepts.
Mixed forms: Hybridization of native and borrowed elements (e.g., "лайкать").

Semantic neologisms—instances where an existing word acquires a novel meaning—can be detected through topic modeling, keyword extraction, and deep embedding-based word sense disambiguation. A text’s theme is identified, and discrepancy between the word’s semantic field (derived from embeddings) and document topic signals emergent meaning (Torres-Rivera et al., 2020).

Distributional semantic analysis in diachronic corpora reveals two predictive factors:

Semantic sparsity: New words emerge in underpopulated regions of semantic space,

$d(w, \tau) = |\{u : \cos(\mathbf{v}_w, \mathbf{v}_u) \geq \tau\}|$

Frequency growth rate of neighbors: Rapid growth among semantic neighbors,

$r(w, \tau) = \frac{1}{d(w, \tau)} \sum_{u: \cos(\mathbf{v}_w, \mathbf{v}_u) \geq \tau} r_s(\{1,\ldots, T\}, f_{1:T}(u))$

where $r_s$ is the Spearman rank correlation between time and frequency (Ryskina et al., 2020).

3. Neologism Learning in Computational and AI Systems

Modern approaches extend neologism learning into AI by introducing new tokens associated with novel concepts into LLMs. The method, termed "neologism learning," involves adding new word embeddings and optimizing them with preference-based losses, while freezing other parameters (Hewitt et al., 9 Oct 2025, Hewitt et al., 11 Feb 2025). For $k$ concepts, tokens $c_1, \ldots, c_k$ expand the vocabulary, and only $\mathbf{E}_{c_1}, \ldots, \mathbf{E}_{c_k}$ are trained:

$\min_{\mathbf{E}_{c_1}, ..., \mathbf{E}_{c_k}} \mathbb{E}_D \left[ \mathcal{L}(x, y^{(c)}, y^{(r)}) \right]$

with $\mathcal{L}$ as the APO-up loss:

$\mathcal{L}(x, y_c, y_r) = -\log \sigma\bigg( \beta \log \frac{p_\theta(y_c|x)}{p_\theta(y_r|x)} + \beta \log \frac{p_{\theta_0}(y_c|x)}{p_{\theta_0}(y_r|x)} \bigg) - \log \sigma\bigg( \beta \log \frac{p_\theta(y_c|x)}{p_{\theta_0}(y_c|x)} \bigg)$

This method enables precise control of model outputs, such as response length, diversity, incorrectness, flattery, or domain-specific behaviors (Hewitt et al., 11 Feb 2025, Hewitt et al., 9 Oct 2025).

Self-verbalization is the phenomenon whereby models trained with neologisms can describe, in natural language, the meaning or function of a new token—thus exposing latent machine concepts. Plug-in evaluation validates these descriptions: inserting the model’s own verbalization in the prompt and verifying matched conceptual control (Hewitt et al., 9 Oct 2025).

Discovery of "machine-only synonyms"—words that, while semantically opaque to humans, drive similar model behavior when substituted—highlights the divergence between human and machine conceptual spaces.

4. Temporal Drift, Robustness, and Evaluation Benchmarks

Emergence and adoption of neologisms contribute significantly to temporal drift—the divergence between training and inference distributions in LLMs. The NEO-BENCH resource evaluates LLM performance on tasks involving neologisms, showing degradation in metrics such as perplexity:

$\mathrm{Perplexity} = \exp\left( -\frac{1}{N} \sum_{i=1}^N \log P(w_i) \right)$

and downstream performance in machine translation, question answering, and definition generation (Zheng et al., 19 Feb 2024). Inserting a single neologism can halve translation quality, and LLMs' perplexity and task accuracy are strongly correlated with their knowledge cutoff date and the linguistic structure of neologisms (lexical, morphological, or semantic).

Benchmark results underscore that dynamic vocabulary management, continual learning, and improved tokenization strategies are critical for maintaining LLM robustness in the face of ongoing lexical innovation.

WINODICT introduces in-context learning benchmarks where LLMs must acquire a new word from its definition during inference. Even large models exhibit significant drops in task accuracy (18+ percentage points compared to canonical schema resolution) when resolving synthetic neologisms, especially when the definition is integrated into the prompt (Eisenschlos et al., 2022).

5. Applications in Under-resourced and Historical Languages

Neologism learning is particularly challenging in under-resourced languages, where lexical gaps are acute (Camacho, 2023). The proposed computational methodologies filter foreign candidate words using stringent phonological and orthographic rules to ensure compatibility with the target language’s native structure; only words passing IPA transcription checks and contextual syllable constraints are retained.

Diachronic corpora such as SiDiaC for Sinhala facilitate tracking neologism formation over long time spans, supporting historical lexicography, semantic shift analysis, and genre-specific studies (Jayatilleke et al., 22 Sep 2025). Careful annotation, modernisation, and syntactic segmentation create the basis for statistical detection of emergent word forms and semantic changes, as measured by relative frequency and context-driven outlier analysis.

6. Interdisciplinary and Domain-specific Innovations

Neologisms also encapsulate new interdisciplinary fields, as in "econophysics" (Sharma et al., 2011), or foster conceptual precision in scientific modeling (e.g., the introduction of "sustainant" for resilience studies (Tamberg et al., 2020)). In complexity science, "artification" designates the transformation of non-art objects into art, bridging domains of aesthetics and computational modeling (Franceschet, 2014).

Computational tools such as NeoN for Polish exemplify automated neologism detection and classification pipelines, combining corpus-based filtering, context-aware lemmatization, and LLM-driven semantic and domain assignment, resulting in scalable systems for lexical innovation tracking (Tomaszewska et al., 21 May 2025).

NeLLCom-Lex provides an agent-based neural modeling framework for simulating semantic change and neologism integration, leveraging grounded supervised learning and communicative-reinforcement learning pipelines that emulate human-like adaptation, conceptual drift, and lexical diversity (Zhang et al., 26 Sep 2025).

7. Future Directions and Open Problems

Empirical studies highlight the need for improved detection and modeling of neologism emergence, semantic drift, and adoption dynamics in both human and artificial contexts. Ongoing research targets:

Integration of time-sensitive and context-adaptive language modeling to mitigate the impact of temporal drift.
Refining embedding algorithms to represent polysemy and semantic novelty.
Expanding benchmarks and corpora for neologism tracking, especially in under-resourced and historical language contexts.
Developing human–machine shared vocabularies for interpretability and controllability (Hewitt et al., 11 Feb 2025).
Applying preference-based and compositional learning for multiple simultaneous concept acquisition and controllability (Hewitt et al., 9 Oct 2025).

Research into neologism learning thus offers robust tools and frameworks for understanding language evolution, guiding AI development, enabling precise communication, and supporting the ongoing expansion of lexical and conceptual space in both human and computational languages.