AVoCaDO SFT: Adaptive Vocabulary Fine-Tuning

Updated 15 October 2025

The paper demonstrates that adapting the vocabulary as an optimizable parameter leads to improved tokenization and downstream performance, as evidenced by significant F1-score gains.
AVoCaDO SFT employs a fragment score to iteratively select domain-specific tokens, reducing over-segmentation and preserving semantic coherence.
Contrastive regularization is integrated to maintain alignment between new token embeddings and the original pretrained representations, ensuring generalization in low-resource settings.

AVoCaDO SFT refers to two distinct but identically named supervised fine-tuning (SFT) techniques from seminal works in natural language processing and audiovisual captioning. This entry focuses on the original AVoCaDo SFT—an approach for adapting the vocabulary of pretrained LLMs to downstream domains—as introduced in "AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain" (Hong et al., 2021). The method reconceptualizes the vocabulary itself as an optimizable parameter, enabling targeted augmentation and robust adaptation via both tokenization statistics and contrastive regularization. AVoCaDo SFT is notable for its strong empirical performance, efficiency in data usage, and generality across diverse application domains.

1. Vocabulary as a Fine-Tuning Parameter

AVoCaDo SFT fundamentally departs from standard transfer learning paradigms by treating the token vocabulary as a mutable parameter during fine-tuning. Instead of leaving the pretrained vocabulary $V_p$ fixed—which is suboptimal when the downstream data distribution diverges from that of pretraining—the AVoCaDo algorithm expands $V_p$ with selected domain-specific lexemes from the downstream corpus $V^D$ , forming the adapted vocabulary $V^A = V_p \cup \{ \text{selected subset of}\ V^D \}$ . The selection process is guided by a data-driven metric (the fragment score) to optimize tokenization granularity for domain-relevant terms. By augmenting $V_p$ only with tokens that reduce over-segmentation (without indiscriminately increasing vocabulary size), AVoCaDo creates holistic tokenizations that preserve the semantic integrity of frequent domain-specific words.

This adaptive vocabulary optimization is executed iteratively: tokens are added from $V^D$ in decreasing frequency order, corresponding to subwords poorly represented by $V_p$ . The process continues until a strict tokenization constraint, measured by the fragment score, is satisfied. This approach uniquely allows the method to optimize both the content and the size of the extended vocabulary based on empirical downstream corpus statistics.

2. Tokenization Fragment Score and Vocabulary Selection

The central technical mechanism in AVoCaDo SFT is the fragment score,

$f_{\mathcal{C}}(V) = \frac{\text{number of subwords produced by tokenizing corpus}\ \mathcal{C}\ \text{with vocabulary}\ V}{\text{number of words in}\ \mathcal{C}}$

which quantifies average subword fragmentation in the target corpus $\mathcal{C}$ . A high $f_{\mathcal{C}}(V)$ implies excessive splitting of domain terms, associated with semantic dilution and degraded performance on downstream tasks. The optimization operates as a greedy search: at each iteration, tokens from $V^D$ are incorporated into the vocabulary in batches of size $\beta$ (after an initial addition of size $\alpha$ ), and $f_{\mathcal{C}}(V^A)$ is monitored. The process halts when $f_{\mathcal{C}}(V^A)$ drops below the hyperparameter threshold $\gamma$ . This iterative design ensures that only tokens essential for holistic tokenization are included, limiting vocabulary bloat and mitigating the risk of overfitting to idiosyncratic downstream artifacts.

Algorithmic control via hyperparameters ( $\alpha$ , $\beta$ , $\gamma$ ) allows balance between adaptation granularity, efficiency, and robustness. Practical application reveals that this "fragment-score-driven" token selection is critical for capturing domain-specific morphological and syntactic constructions that would otherwise be fragmented and poorly represented by $V_p$ .

3. Contrastive Regularization to Preserve Embedding Robustness

Augmenting the vocabulary introduces new embeddings that are initialized (e.g., randomly or via interpolation), but downstream datasets are typically insufficiently large to robustly optimize these embeddings without overfitting. To address this, AVoCaDo SFT employs a contrastive regularization loss that couples the representations from the original and the augmented vocabularies. For each input sentence, tokenized both with $V_p$ and $V^A$ , the representations at a particular layer $l$ — $h_p^{(l)}$ and $h_a^{(l)}$ —form positive and negative pairs for contrastive learning:

$L_{\mathrm{reg}} = -\frac{1}{B} \sum_{i=1}^{B} \log \left[ \frac{ \exp\left(\mathrm{sim}(h_{a, i}^{(l)}, h_{p, i}^{(l)})/\tau \right) } {\sum_j \exp\left( \mathrm{sim}(h_{a, i}^{(l)}, h_{p, j}^{(l)}) / \tau \right)} \right]$

Here, $B$ denotes batch size, $\tau$ is the temperature parameter, and $\mathrm{sim}(\cdot,\cdot)$ is cosine similarity. The total loss becomes

$L = L_{CE} + \lambda L_{\mathrm{reg}}$

with $\lambda$ (typically $1.0$) controlling the regularization strength. This design ensures that representations for new tokens remain close to the pretrained manifold, preventing catastrophic divergence and preserving generalization, particularly when downstream data is scarce.

4. Empirical Performance and Cross-Domain Generality

AVoCaDo SFT systematically improves downstream performance across several domains when compared with baseline fine-tuning that retains a static vocabulary. For instance:

Domain	Benchmark	Baseline F₁	AVocaDo F₁	Absolute Gain
Biomedical	ChemProt	79.38	81.07	+1.69
Computer Science	ACL-ARC	56.82	67.28	+10.46
News	HyperPartisan	84.51	89.31	+4.80
Reviews	Amazon	55.50	68.51	+13.01

These improvements are realized on datasets typically below 5,000 instances, indicating the data efficiency of the approach. Similar relative improvements are observed with various pretrained models (e.g., BERT_base, SciBERT, BioBERT), and the method’s application does not require extensive domain-specific corpora. This highlights the viability of AVoCaDo SFT for low-resource and high-specialization scenarios.

5. Comparison to Other Vocabulary Adaptation Techniques

AVoCaDo SFT departs from prior domain adaptation methods such as those employed in SciBERT and BioBERT, which rely on large-scale adaptive pretraining on domain-specific corpora. Unlike exBERT, which also expands vocabulary but requires supplementary corpora for embedding learning, AVoCaDo SFT utilizes only the available downstream data. It further distinguishes itself by automating token selection via the fragment score (rather than manual curation), tying vocabulary growth directly to measurable tokenization quality. The integrated contrastive regularization is distinctive in constraining new token embeddings within the pretrained representational space.

The design enables resource-efficient specialization without additional data collection or computationally intensive pretraining phases.

6. Applications, Limitations, and Future Directions

The AVoCaDo SFT technique is applicable wherever distributional mismatch between pretraining and downstream data compromises tokenization (and consequently, model accuracy). These situations include domain-specialized classification, entity recognition, question answering, and other tasks in fields such as biomedical parsing, technical text mining, news analytics, and customer review interpretation.

Limitations include sensitivity to the fragment score threshold $\gamma$ and step sizes ( $\alpha$ , $\beta$ ). Overly aggressive token addition may dilute vocabulary efficiency, while insufficient expansion may fail to remedy fragmentation. The regularization design presupposes a degree of alignment between pretrained and downstream domains; extreme out-of-domain adaptation may require further developments.

Planned research avenues include extension to multilingual or cross-modal settings, development of more sophisticated subword selection criteria, refinement of regularization methods (potentially leveraging other forms of representation alignment), and application to generative pretraining architectures.

In summary, AVoCaDo SFT articulates a new paradigm for domain adaptation in token-based LLMs by (i) treating the vocabulary as an optimizable parameter, (ii) optimizing tokenization quality via data-driven statistics, and (iii) regularizing new embeddings to maintain generalization. The method robustly and efficiently improves downstream performance without the need for large external corpora or high-cost retraining.

Markdown Report Issue Upgrade to Chat

References (1)

AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AVoCaDO SFT.