Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting Language Models via Token Translation (2411.00593v2)

Published 1 Nov 2024 in cs.CL, cs.AI, and cs.LG
Adapting Language Models via Token Translation

Abstract: Modern LLMs use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English LLMs, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.

Adapting LLMs via Token Translation

The research paper titled "Adapting LLMs via Token Translation" introduces a novel methodology called Sparse Sinkhorn Token Translation (S2T2) aimed at enhancing the performance of LLMs when applied to new target domains. The central issue addressed is the deficiency that arises when LLMs, initially trained with a fixed tokenizer, are deployed in domains that differ from those of their training data. Specifically, the standard tokenizer often results in suboptimal compression and increased inference costs, complicating efforts to maintain semantic alignment.

The S2T2 approach focuses on creating a tailored tokenizer for the target domain, allowing for more effective utilization of a pre-trained model’s predictive capabilities. By translating tokens between the source and target domains without relying on parallel data, S2T2 overcomes inherent limitations faced by LLMs processing domain-specific texts, such as protein sequences.

Key Methodological Contributions

The paper delineates the Sparse Sinkhorn translation process, which operates by injecting a weight-tied sparse optimal transport (OT) layer both in the token embedding and LLM head. By implementing a sparse OT framework, the algorithm effectively learns to map sequences between the two vocabularies, leveraging sparse representations that are more computationally efficient. The OT matrix, which becomes sparse through iterative projections, is calculated without direct parameterization, reducing the computational burden and enhancing scalability.

Experimental Evaluation

Experiments conducted using English LLMs on the UniRef50 protein sequence dataset provide strong evidence for the efficacy of S2T2. The results indicate that S2T2 achieves superior performance in perplexity and bits-per-byte (BpB) metrics compared to baseline methods, which include unconstrained and dense translation models. Specifically, S2T2 provides an effective initialization for continual fine-tuning, enhancing LLM quality and compression.

Remarkably, the S2T2 approach not only outperforms simple fine-tuning with either the original or new tokenizers but also demonstrates successful parameter transferability between models of different sizes. Notably, a translator learned on a smaller model (OLMo-1B) was effectively applied to a larger model (OLMo-7B), underscoring the potential for scalable model refinement without the need for extensive retraining.

Implications and Future Directions

The S2T2 method holds substantial implications for the development of more adaptable, multi-domain LLMs. Practically, this research suggests pathways to reduce inference costs and improve semantic coherence across diverse datasets, which is invaluable for applications in bioinformatics, code translation, and beyond. Theoretically, it contributes to the discourse on probabilistic and sparse methods in machine translation and domain adaptation, challenging conventional reliance on parallel corpora.

Future research could explore the integration of S2T2 with various modalities beyond proteins, such as code or image data, further extending the utility of LLMs. Additionally, the exploration of unified multidomain tokenizers that combine multiple domain vocabularies remains an intriguing, yet challenging, avenue for further investigation.

In summary, the approach provides a rigorous, scalable mechanism for expanding the applicability of LLMs, ensuring their utility in increasingly complex, diverse environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024.
  2. A method for finding projections onto the intersection of convex sets in hilbert spaces. In Advances in Order Restricted Statistical Inference: Proceedings of the Symposium on Order Restricted Statistical Inference held in Iowa City, Iowa, September 11–13, 1985, pages 28–47. Springer, 1986.
  3. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 644–648, 2013.
  4. Quadratically regularized optimal transport on graphs. SIAM Journal on Scientific Computing, 40(4):A1961–A1986, 2018.
  5. Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
  6. Olmo: Accelerating the science of language models. Preprint, 2024.
  7. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  8. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016.
  9. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  10. Toward a theory of tokenization in llms. arXiv preprint arXiv:2404.08335, 2024.
  11. Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of LLMs for low-resource NLP. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=sBxvoDhvao.
  12. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhili Feng (22 papers)
  2. Tanya Marwah (13 papers)
  3. Lester Mackey (79 papers)
  4. David Alvarez-Melis (48 papers)
  5. Nicolo Fusi (26 papers)