Self-Verbalizations of Neologisms

Updated 28 December 2025

Self-verbalizations of neologisms are processes where language users generate natural language explanations for newly coined or repurposed words.
Recent computational frameworks show that training only new token embeddings in LLMs recovers over 90% of intended behavioral modifications.
This mechanism enhances model interpretability and controllability while bridging insights between historical linguistics and modern NLP.

Self-verbalizations of neologisms refer to the process by which a language user—whether human or artificial—articulates the meaning or function of a recently created or repurposed lexical item in its own words. In human natural language evolution, neologisms emerge to fill lexical gaps, support novel conceptual needs, or facilitate cognitive or communicative efficiency. In the context of LLMs, self-verbalization captures the model’s ability to generate natural-language explanations or synonyms elucidating what a newly trained (“neologism”) token now means to the model, frequently without ever being shown explicit descriptions during training. This property provides a unique reflection of the internalization and mapping of new concepts within linguistic or neural representational spaces, and has become both an analytical tool and controllability interface in modern computational linguistics and natural language processing.

1. Formalization and Computational Frameworks

The phenomenon of self-verbalization of neologisms is formalized in both historical linguistics and artificial LLMs, albeit with distinct mechanisms. In historical English, denominalization (zero-conversion) is rigorously defined as the process by which an existing noun $w$ (first attested as such at $t_n$ ) becomes attested in verbal use at $t_v > t_n$ . Quantitatively, key variables include $\mathrm{change}(w) \in \{0,1\}$ (indicating whether the word ever denominalizes) and $d(w) = t_v - t_n$ (the lag between noun and verb use) (Shekarchi et al., 2020). Modern LLM-based approaches operationalize neologism learning via explicit vocabulary expansion: given a pretrained model with vocabulary $V$ and embedding matrix $E \in \mathbb{R}^{d \times |V|}$ , new tokens $c_i$ are added, augmenting $V'$ and extending $E'$ accordingly, with only the newly introduced embeddings $e_{c_i}$ optimized, freezing all original parameters (Hewitt et al., 9 Oct 2025, Park et al., 21 Dec 2025).

Self-verbalization in LLMs is then elicited post-training by querying the model: for a learned token $c$ , prompts such as “What does $c$ mean to you?” or requests for synonyms force the model to map its internalized behavioral axis into a natural-language or subword-level output (Hewitt et al., 9 Oct 2025, Park et al., 21 Dec 2025).

2. Methodologies for Eliciting and Evaluating Self-Verbalizations

Experimental protocols to study self-verbalization of neologisms proceed as follows:

Training Method: Neologism tokens are trained using labelled datasets $D = \{(x_j, y^+(j), y^-(j))\}$ , where $x_j$ is a prompt containing the neologism, $y^+(j)$ a response exhibiting a target concept, and $y^-(j)$ a response corresponding to default or baseline behavior. Typical loss functions include negative log-likelihood and Anchored Preference-Optimization (APO-up), which adjust embedding vectors so that the model associates $c$ with the intended behavior while minimizing drift from the base model (Hewitt et al., 9 Oct 2025, Park et al., 21 Dec 2025).
Self-Verbalization Probing: After training, prompts that either request a list of synonyms or instruct the model to “describe what $c$ responses are like” are used to extract the model’s internal representation of the concept encoded in $c$ . Both free-form answers and structured surveys (e.g., 12-question introspective probes) are utilized. In some cases, output is subsequently distilled by a higher-capacity model into a single human-readable instruction (Hewitt et al., 9 Oct 2025).
Plug-in Evaluation: To assess the faithfulness and functional relevance of self-verbalizations, candidate natural-language instructions or synonyms (drawn from the model’s own explanations) are inserted in place of the neologism in test prompts. The behavioral outputs are then compared to those obtained with the original token, using metrics such as gap closure on controlled tasks (e.g., mean word/sentence count, fraction of target words, or LLM-judged scores) (Hewitt et al., 9 Oct 2025).

Evaluation Stage	Procedure	Metric
Self-verbalization	Free-text/synonym probing	Qualitative content
Plug-in validation	Replace neologism with verbalization in test	% of behavioral gap closed

3. Empirical Properties and Mathematical Formulations

Several salient findings have emerged from the empirical study of self-verbalizations:

High-fidelity steering with minimal parameter updates: Training only the embedding vector(s) for new tokens recovers over 90% of the intended behavioral modification across diverse concepts while preserving the original model parameters (Hewitt et al., 9 Oct 2025, Park et al., 21 Dec 2025).
Self-description without explicit supervision: Models are able to produce accurate textual descriptions of a neologism’s effect—such as “responses are characterized by a lack of complete, coherent, or meaningful answers”—despite never observing explicit explanations during training (Hewitt et al., 9 Oct 2025).
Machine-only synonyms: Some self-verbalizations are unintuitive or opaque to humans (e.g., using “lack” as a synonym for a single-sentence constraint), yet reliably trigger the same behavior in the model. These tokens, termed "machine-only synonyms," provide evidence for latent representational axes not aligned with human lexical categories (Hewitt et al., 9 Oct 2025).
Emergent lexical innovation: LLMs occasionally invent previously unseen lexical items (e.g., “Mutexpoitary,” “Poornessily”) in response to probing. This contrasts with fine-tuned models (e.g., LoRA-adapted), which lack the capacity for such neologistic self-description due to not internalizing the token as an active concept (Park et al., 21 Dec 2025).

Formally, the core neologism-learning objective is:

$\min_{e_c}~ \mathbb{E}_{(x, y^{(c)}, y^{(r)}) \sim \mathcal{D}} \left[ -\log \sigma(\ldots) - \log \sigma(\ldots) \right]$

where the terms encourage the model to prefer chosen responses under the neologism context and to anchor predicted likelihoods to the pre-trained model (Park et al., 21 Dec 2025).

4. Cognitive and Linguistic Interpretation

From a cognitive perspective, the ability to self-verbalize neologisms reflects the model’s internal conceptualization and mapping of behavioral axes to language. In historical linguistics, denominalization is observed to be favored (i) for shorter words, likely due to processing efficiency, and (ii) for less recently frequent nouns, possibly due to reduced lexical competition (Shekarchi et al., 2020). The computational analog in LLMs is the efficient recoding of new behavioral instructions into a minimal representational update.

A plausible implication is that such self-verbalizations provide a direct probe of the implicit semantic space constructed by LLMs, revealing both human-aligned and model-unique axes of generalization. The occasional emergence of "machine-only" synonyms also indicates that the learned space may be systematically non-isomorphic with human lexical categories, especially under parameter-efficient updates targeting concept control (Hewitt et al., 9 Oct 2025, Park et al., 21 Dec 2025).

5. Applications and Impact in NLP

Self-verbalizations of neologisms serve multiple roles in computational linguistics and NLP:

Interpretability: They expose the meanings of newly introduced control tokens, aiding in debugging, concept refinement, and transparency in model steering (Hewitt et al., 9 Oct 2025).
Behavioral steering: Neologisms allow modular, on-demand alteration of model outputs (e.g., controlling flattery, verbosity, or factuality), with the self-verbalization acting as a bridge to natural-language control or further programmatic manipulation.
Human-machine communication: Analysis of self-verbalizations opens prospects for shared lexica between humans and LLMs, enabling grounded negotiation of concepts and semantics (Hewitt et al., 9 Oct 2025).
Linguistic creativity modeling: The generative capacity for internal synonyms and conceptual paraphrases directly models phenomena observed in natural language evolution—lexical innovation, semantic drift, and zero-conversion as mechanisms of lexicon expansion (Shekarchi et al., 2020).

6. Limitations, Open Questions, and Future Directions

Several challenges and research frontiers remain:

Faithfulness and hallucination: Not all verbalizations accurately reflect the underlying concept; some verbalizations can be artifacts of question formulation or context effects, necessitating validation via plug-in evaluation or further interpretability techniques (Hewitt et al., 9 Oct 2025).
Model dependence: Machine-only synonyms may be non-transferable between model families, and their criteria for emergence remain incompletely characterized (Hewitt et al., 9 Oct 2025).
Scalability and compositionality: Training large numbers of neologisms or encoding compound concepts raises theoretical and practical issues around embedding interference, representational efficiency, and long-term capacity.
Mechanistic interpretability: The pathways by which LLMs generalize from a single added embedding to robust, out-of-context language generation remain to be fully elucidated (Hewitt et al., 9 Oct 2025).
Systematic assessment: Quantifying the frequency and scope of invented verbalizations, integrating verbalization feedback into the training loop, and designing targeted prompting or loss functions to control the type and content of self-verbalizations constitute active areas of inquiry (Park et al., 21 Dec 2025).

7. Relationship to Historical and Human Language Change

The study of self-verbalizations of neologisms in machine models both mirrors and extends findings in historical linguistics. In English, denominalization is subject to principled constraints—short word length and low recent frequency favor lexical innovation—with cognitive explanations rooted in articulatory and semantic economy (Shekarchi et al., 2020). LLMs, when endowed with minimal new parameters, exhibit comparable tendencies: neologisms develop context-grounded meanings and propagate their effects through compositional generalization and paraphrasing, even in the absence of explicit human guidance.

The systematic investigation of self-verbalizations thus bridges computational, cognitive, and empirical perspectives on how new words emerge, solidify, and become communicable—whether in human discourse or artificial intelligence.