Vocabulary In-Context Learning (VICL)

Updated 16 November 2025

Vocabulary In-Context Learning (VICL) is a framework where models acquire word semantics directly from contextual cues without explicit memorization or parameter updates.
Key methodologies such as WINODICT, Broccoli, and latent space clustering evaluate VICL by testing one-shot semantic induction and stabilizing in-context learning with specific metrics.
The approach highlights theoretical advances like universal approximation via precise positional encoding and exposes practical challenges in extending context-based word learning.

Vocabulary In-Context Learning (VICL) is a paradigm that investigates and exploits the capability of both biological and artificial language learners to acquire, distinguish, and manipulate the semantics of vocabulary items directly from context, rather than via explicit memorization or parameter adaptation. Contemporary formulations operationalize VICL as measuring or inducing the acquisition of word meanings from context during inference, especially in LLMs, or as structuring demonstrations and internal representations to stabilize and enhance in-context learning. VICL thus spans computational linguistics, memory modeling, and Transformer-based architectures.

1. Formal Definitions and Scope

VICL encompasses the set of evaluation and training procedures where knowledge about novel vocabulary is acquired, updated, or evaluated solely via transient context windows—without explicit parameter modification. Formally, a VICL instance can be described as a tuple

$(d, x, o_1, o_2, y)$

where:

$d$ : dictionary definition or semantic description of a novel (possibly synthetic) token,
$x$ : sentence or document prefix up to a target position,
$o_1, o_2$ : candidate spans (e.g., noun phrases) for resolution or labeling,
$y$ : sentence or document suffix.

The goal of VICL benchmarks and models is to evaluate the system’s ability to acquire and operationalize the semantics of this new token in situ, purely from the context and definition provided in $d$ , during a single inference pass.

A related but distinct formalism appears in functional approximation settings, where VICL addresses the capacity of a (prompted) Transformer to approximate arbitrary mappings by choosing context tokens from a finite vocabulary, potentially enhanced with positional encodings. This aligns VICL with the universal approximation property (UAP) of neural architectures.

2. Key Methodologies

2.1. Synthetic Word Acquisition: WINODICT

WINODICT (Eisenschlos et al., 2022) introduces a rigorous VICL benchmark by rewriting Winograd-style pronoun resolution problems. The core driver is to measure LLMs’ (LLMs) capacity for one-shot or few-shot semantic induction:

Synthetic vocabulary items are created using probabilistic models of English n-grams, with morphologies induced by suffix-transformation rules.
Each Winograd schema’s “key concept” is replaced by a synthetic word, embedding its WordNet definition directly in the prompt (“The verb to <new-word> means <d>”).
The model must resolve coreference by relying on the in-prompt definition, not parameter knowledge.

The evaluation protocol computes accuracy as the proportion of correct binary choices, averaging over 5 splits (~500 synthetic-labeled examples). Human baselines are also established.

2.2. Embedded Human VICL: Broccoli

Broccoli (Aydin et al., 2021) generalizes VICL to human language acquisition during natural reading. The system interleaves vocabulary exposure with ordinary reading by:

Assigning a recall probability (SuperMemo-style, $R_w = 2^{-t_w/H_w}$ ) per word lemma, and updating this after each exposure.
Estimating contextual guessability $G_w$ using a LLM.
Computing a joint comprehension probability $P_w = R_w + G_w - R_w G_w$ .
Prioritizing which words to “switch” to their translation by the composite score $S(w) = P_w \gamma_w$ and integrating them at specified density into current text.
Relying on natural spaced repetition as words recur in typical information diets.

Short-term and long-term retention is measured via multiple-choice and fill-the-gap tests, with outcomes compared to traditional tables, and cognitive strategies are surveyed.

2.3. Latent Space Clustering for Stable ICL

Vocabulary-defined semantics (Gu et al., 29 Jan 2024) frames VICL as the construction of semantic reference frames in latent space corresponding to vocabulary items. The process:

Computes semantic basis vectors for each output token using the pseudoinverse of the LM head matrix ( $W^+ = (W^TW)^{-1}W^T$ ), yielding $r_i = \ell_i^T W^+$ .
Embeds each data example’s final representation and applies a calibration module:

$\lambda(r) = \text{LN}(\text{MLP}(\text{CA}(r)))$

Clusters examples by cosine similarity to their vocabulary-defined semantic bases.
Selects in-context demonstrations for each query by latent-space cluster proximity, stabilizing performance versus input-space or random selection.

Efficiency is increased by limiting head computations to only relevant vocabulary clusters and using a lightweight clustering module ( $O(d^2)$ parameters).

2.4. Transformer Approximation Perspective

A theoretical lens (Ma et al., 9 Nov 2025) formulates VICL in a function-approximation framework: given a vocabulary $V = V_x \times V_y$ , context $Z = [z^{(1)}, ..., z^{(n)}, z]$ , and Transformer weights held fixed, can prompt tokens be chosen so the Transformer maps queries to arbitrary targets? The central findings:

Without positional encoding, the class of VICL-transformer functions can be shown (via reductions to FNNs with finite weights) to lack universal approximation.
Introducing absolute positional encoding satisfying a density condition ( $\mathcal{P}_x^{(j)} = j \cdot \alpha$ , irrational coordinates) allows arbitrary function approximation for continuous mappings.
Universal approximation (UAP) in VICL thus depends crucially on positional encoding and the structure of the vocabulary.

3. Benchmarking and Quantitative Results

3.1. WINODICT Results

Significant accuracy gap ( $\geq 18$ points) between original Winograd and synthetic VICL tasks.
PaLM-540B, 5-shot: 95.6 ± 1.3% (Winograd), 68.5 ± 1.9% (WINODICT) — drop of 27.1 points.
Small models perform near random (50–55%) on synthetic tasks even with 5-shot demonstrations.
Best results occur when synonyms are provided as suffix rather than definitions as prefix.
Human zero-shot baseline on WINODICT: ~83.3%.

3.2. Broccoli User Study

Short-term MC retention: Broccoli 55% ± 4pp vs. table-control 37–42% ± 3–4pp.
Long-term MC retention (1–4 weeks): all conditions converge to ~30%; table and Broccoli indistinguishable.
Broccoli reduces explicit mnemonic strategy use by over 10x compared to tables (6–9% vs. 63–72%).
Fill-the-gap performance is also improved in the short term.
Reading speed slows by ~18% when Broccoli switching is active.

3.3. Latent Clustering Performance

VICL (semantic calibration, SC) achieves +3%–49% absolute accuracy improvements over standard ICL, and +5%–20% over kNN Prompting.
GPU time is nearly halved relative to LoRA/IA³ full PEFT.
Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) clustering metrics rise from near zero (fully entangled) to 0.6–0.9 after VICL-based clustering, confirming formation of vocabulary-centric latent clusters.

3.4. Transformative Capacity via Positional Encoding

Without suitable positional encoding, even rich prompt design does not achieve UAP in single-layer Transformers.
With absolute positional encoding ensuring dense coverage, context selection can “program” arbitrary continuous functions, implying, in principle, robust VICL under these architectural conditions. Relative or rotary positional encodings do not, in current proofs, support this property.

4. Theoretical and Practical Implications

VICL benchmarks such as WINODICT expose a fundamental limitation of current LLMs: parameter-free, context-only word learning remains shallow, especially under the pressure of synthetic or out-of-distribution tokens. This highlights a facet of diachronic degradation, i.e., LLMs’ inability to reflect lexical change post-training.

From a modeling perspective, successful VICL frameworks exploit:

Strategic prompt engineering (definition location and form),
Context-aware demonstration selection (latent clustering),
Memory models aligned with natural text recurrence (spaced repetition),
Appropriate architectural permutations (notably, positional encoding in Transformers).

VICL thus offers a diagnostic for both LLM robustness to language change and the stability/efficiency of in-context learning.

5. Limitations and Open Questions

Identified constraints include:

Even with definitions, most LLMs fail to fully integrate new word meanings for tasks like coreference resolution.
Absolute positional encoding unlocks UAP in theory but may require impractically long context lengths in a single Transformer layer for rich approximations; multi-layer extensions remain under-explored.
VICL clustering as in (Gu et al., 29 Jan 2024) currently calibrates only the final Transformer layer; extending semantic axes deeper into the model and to generative tasks remains an open direction.
Computing the pseudoinverse $W^+$ for very large vocabularies may be computationally expensive, although only required once.

A plausible implication is that further architectural changes (e.g., hierarchical memory, multi-context representations) or continual learning techniques may be necessary to fully bridge the gap between human and artificial vocabulary learning.

6. Cross-Field Connections and Future Directions

VICL research connects and contrasts with:

Meta-learning and few-shot learning, wherein models adapt to new concepts or distributions primarily via adaptation of intermediate representations or rapid fine-tuning.
Memory-augmented networks and explicit retrieval-augmented models that maintain dynamic lexical memories or caches outside of parameterized weights.
Cognitive models of word acquisition, particularly those based on incidental/contextual learning and the power of spaced repetition in natural environments.

Future research directions highlighted in the literature include: extending latent semantic calibration to intermediate LM layers, joint training of clustering and head modules, construction of synthetic benchmarks to test the VICL-UAP boundary, and application of VICL mechanisms to generative modeling and translation (e.g., grouping of subword tokens). There is increasing interest in quantifying when and how architectural modifications—such as positional encoding—translate to measurable gains in vocabulary flexibility and semantic stability.