Online Vocabulary Update Algorithm

Updated 31 January 2026

Online vocabulary update algorithm is a method that continuously refines word, subword, or feature representations to adapt to dynamic language environments.
It leverages incremental statistical updates and probabilistic frameworks to handle nonstationarity and domain drift in streaming data.
Empirical results demonstrate enhanced out-of-vocabulary recognition and increased inference speed across NLP, input methods, and federated applications.

An online vocabulary update algorithm incrementally refines, expands, or adapts the set of word, subword, or categorical feature representations used by a model as new data arrives in a stream or dynamic environment. Unlike batch-based learning, which processes large static corpora to produce fixed vocabularies, online methods address nonstationarity, domain drift, resource constraints, and interactive use cases where new words or categories appear continuously or input distributions shift. This paradigm spans applications from domain adaptation in NLP, input methods, federated discovery, categorical feature encoding, and dynamic tokenizer extension for LLMs, and encompasses a diversity of mathematical and algorithmic frameworks.

1. Mathematical Frameworks and Update Objectives

Online vocabulary update algorithms formalize vocabulary refinement as a streaming or sequential process under various learning or optimization objectives. Broad settings include:

Incremental unsupervised domain adaptation: Updating word-context co-occurrence statistics without retraining the supervised classifier, so representations for words gradually drift from source toward target distribution as new sentences arrive (Yin et al., 2016).
Dynamic word likelihood estimation: Maintaining a continuous word-likelihood score (IME Word Likelihood, IWL) for each character sequence, which directly informs LLM probabilities used in input decoding (Zhang et al., 2017).
Probabilistic embedding inference: Employing Bayesian online inference to update hash-based categorical feature embeddings in bounded memory while tracking parameter uncertainty (Li et al., 25 Nov 2025).
Tokenization expansion through statistics-driven selection: Dynamically expanding the subword vocabulary of a LLM using fragment scores and BPE statistics to minimize over-fragmentation on streaming domain data (Hong et al., 2021).
Privacy-preserving OOV discovery: Secure aggregation and LDP-based randomization in federated settings for online mining of new words without leaking identifiable user input (Sun et al., 2024).
Transformation-based composition: Reshaping vocabulary to maximize coverage via composition rules applied to base forms and transformation offsets, freeing slots for OOVs and reducing redundancy (Reif et al., 19 Oct 2025).

Many algorithms operate at the level of word, subword, or category-specific tables, e.g., $freq_L[w][i]$ , $IWL(w)$ , posterior parameters $\lambda_{b,j}$ , or embedding matrices, and update these in response to immediate or recent data, rather than making global updates.

2. Algorithmic Strategies and Pseudocode

Representative online vocabulary update routines instantiate the following canonical steps:

Application	Step 1: Data Handling	Step 2: Update Rule	Step 3: Culling or Selection
DA for POS tagging (Yin et al., 2016)	Sentence stream; left/right context	Increment context counts; update features	N/A, all words accumulate stats
IME wordhood (Zhang et al., 2017)	User confirmed input sequence	Add substrings; segment; IWL increments	Periodic pruning by IWL/capacity
Hash embedding (Li et al., 25 Nov 2025)	Streaming categorical features	Bayesian VI update for hashes/Buckets	Fixed parameter table; no culling
Domain tokenizer (Herold et al., 30 Sep 2025)	Batch or incremental domain corpus	Merge appending to BPE; embedding averaging	Budget-limited selection
Federated OOV (Sun et al., 2024)	User device OOV prefixes	LDP randomization; secure aggregation	Top-k by noisy count per prefix/round
Vocab reshaping (Reif et al., 19 Oct 2025)	Ongoing corpus statistics	Surface form removal; base+transform insertion	Lowest-frequency forms removed

This diversity illustrates the alignment between the nature of the data stream, type of vocabulary manifold (lexicon, subword, categorical), and resource or privacy constraints, but all share the fundamental property of incremental, stateful evolution with update rules local to each step.

3. Theoretical Properties: Memory, Convergence, and Robustness

Memory efficiency is a recurring constraint: probabilistic hash embedding maintains a bounded $B \times d \times 2$ parameter table irrespective of vocabulary size (Li et al., 25 Nov 2025), IME vocabularies are capped and pruned (Zhang et al., 2017), while federated OOV discovery enforces client and per-layer capacity strictness (Sun et al., 2024). Categorical streaming settings preclude per-item tables, favoring compact shared representations.
Convergence: Domain adaptation by online counting converges to batch statistics over the full test corpus; empirically, accuracy matches batch DA within a few dozen occurrences per word (Yin et al., 2016). Online IME adaptation stabilizes top-1 accuracy within a few thousand MIUs and shows diminishing fluctuations (Zhang et al., 2017). Bayesian hash embedding is formally invariant to item arrival order in exact updates (Li et al., 25 Nov 2025), and periodic vocabulary expansion maintains stable fragment scores (Hong et al., 2021).
Robustness to order and drift: Probabilistic methods guarantee permutation invariance of the posterior, mitigating catastrophic forgetting and arrival-order sensitivity (Li et al., 25 Nov 2025). Online segmentation-based wordhood treats unknown words no differently from known, promoting resilient adaptation to shifting user input (Zhang et al., 2017).

4. Application Domains and Empirical Performance

Domain-adaptive NLP: Online DA for POS tagging delivers accuracy within $0.03\%$ of batch DA, with up to $6\%$ better OOV tagging than static representations (Yin et al., 2016).
IME input likelihood: Online IWL adaptation for Chinese IME achieves top-1 scores of $55.3\%$ on People’s Daily and $51.4\%$ on Touchpal domain, far surpassing static trigram (Zhang et al., 2017).
LLM tokenizer customization: Incrementally extending tokenizers reduces domain fertility by up to $20\%$ (sequence length), with a net $8$– $30\%$ increase in inference throughput and negligible impact on NLU/accuracy (Herold et al., 30 Sep 2025).
Vocabulary reshaping: Removing up to $10\%$ of surface forms and recomposing OOVs expands coverage to tens of thousands of word forms, with performance differential within $1$– $3\%$ on benchmarks (Reif et al., 19 Oct 2025).
Federated OOV mining: Gboard’s privacy-preserving pipeline provides $92.1\%$ coverage of rare OOVs after two passes, under $(\epsilon', \delta) = (0.315, 10^{-10})$ central DP (Sun et al., 2024).

These results consistently demonstrate that online vocabulary updates, via a range of mechanisms, match or outperform batch and static approaches in coverage and adaptation, especially for uncommon or new terms.

5. Practical Hyperparameters and Engineering Considerations

Update batch size and frequency: IME IWL updates use $\alpha=1.0$ , $\beta=5.0$ , $\gamma=1.0$ ; periodic pruning every $per$ steps up to cap (Zhang et al., 2017). Online BPE expansion via AVocaDo controls initial and incremental merge count $\alpha$ , $\beta$ , and fragment score threshold $\gamma$ (Hong et al., 2021). LLM domain vocab extension sets merge budget $N$ , typically $K$ to several tens of thousands (Herold et al., 30 Sep 2025).
Initialization: New token embedding rows are initialized by weighted averages of splits (e.g., $e_t = \frac{1}{2}(e_{t_1} + e_{t_2})$ ) (Herold et al., 30 Sep 2025), or averages over constituent subwords (Hong et al., 2021); transformation offsets for compositional vocab are induced from the difference between surface and base embeddings (Reif et al., 19 Oct 2025).
Pruning/culling: IME and federated methods prune by minimal IWL or top noisy count; compositional vocab reshaping drops lowest-utility (lowest frequency) forms (Reif et al., 19 Oct 2025).
Privacy/DP mechanics: Federated OOV applies $\epsilon$ -LDP per user-layer, secure aggregation on server, and central $(\epsilon', \delta)$ DP on final release (Sun et al., 2024). Communication and computation strictly bounded.
Resource adaptation: PHE recommends $B \approx 5–10 \times$ expected vocab size and $K = 2–4$ hash functions (Li et al., 25 Nov 2025).

6. Limitations, Challenges, and Future Directions

Retraining burden: Most algorithms avoid classifier retraining, updating only representation or likelihood tables (Yin et al., 2016, Zhang et al., 2017, Hong et al., 2021), making them suitable for real-time or low-latency scenarios.
Backward compatibility: Frequent vocabulary changes risk disrupting downstream services (e.g., position embeddings, old-to-new token mapping); stability of the mapping and model infrastructure is necessary (Hong et al., 2021, Herold et al., 30 Sep 2025).
Culling trade-offs: Aggressive capacity-based pruning may remove rare but valuable terms; criteria are often frequency-based to prioritize utility.
Complexity management: Model architectures must accommodate embedding-table growth or compositional inference logic, ideally minimizing code changes and runtime impact (Reif et al., 19 Oct 2025).
Privacy constraints: In consumer-facing deployments, privacy guarantees via LDP and DP aggregation raise communication and protocol complexity (Sun et al., 2024).

A plausible implication is that ongoing work will further integrate online vocabulary updates with robust continual learning, richer compositionality, and strict privacy constraints, enabling adaptive, scalable, and fair NLP and ML systems in live environments.

7. Comparative Summary Table

Paper/Algorithm	Vocabulary Type	Update Mechanism	Memory Bound	Empirical Gains
FLORS Online DA (Yin et al., 2016)	Lexical/word	Context bigram count increment	$O(\|V\|)$	$<0.03\%$ from batch; $+6\%$ OOV accuracy
OMWA IME (Zhang et al., 2017)	Word (Chinese IME)	Additive IWL/segmentation	Capacity-capped	$+30$ pts top-1 over static trigram
PHE hash (Li et al., 25 Nov 2025)	Categorical features	Bayesian VI, per-hash updates	$O(Bd)$	Outperforms deterministic, permutation-invariant
AVocaDo (Hong et al., 2021)	Subword (BPE)	Fragment score-driven BPE merging	Variable	$+1\sim13$ F1 pts vs. base BERT/SciBERT
Domain Tokenizer (Herold et al., 30 Sep 2025)	Subword (LLM)	Frequency-sorted merge appending	Budgeted	$20\%$ shorter seq, $30\%$ speedup
Federated OOV (Sun et al., 2024)	Lexical/word (OOV)	LDP randomized, secure aggregation	Trie-layer cap	$92.1\%$ coverage, $(0.315,10^{-10})$ DP
Vocab Diet (Reif et al., 19 Oct 2025)	Surface/composit.	Morphological decompositon, offset	$-10\%$ slots	$70$– $95\%$ new forms composable, $<3\%$ perf drop

This cross-section delineates both the technical strategies and measurable advances, confirming the centrality of online vocabulary update algorithms in modern adaptive NLP and ML.