Adapting LLMs via Token Translation
The research paper titled "Adapting LLMs via Token Translation" introduces a novel methodology called Sparse Sinkhorn Token Translation (S2T2) aimed at enhancing the performance of LLMs when applied to new target domains. The central issue addressed is the deficiency that arises when LLMs, initially trained with a fixed tokenizer, are deployed in domains that differ from those of their training data. Specifically, the standard tokenizer often results in suboptimal compression and increased inference costs, complicating efforts to maintain semantic alignment.
The S2T2 approach focuses on creating a tailored tokenizer for the target domain, allowing for more effective utilization of a pre-trained model’s predictive capabilities. By translating tokens between the source and target domains without relying on parallel data, S2T2 overcomes inherent limitations faced by LLMs processing domain-specific texts, such as protein sequences.
Key Methodological Contributions
The paper delineates the Sparse Sinkhorn translation process, which operates by injecting a weight-tied sparse optimal transport (OT) layer both in the token embedding and LLM head. By implementing a sparse OT framework, the algorithm effectively learns to map sequences between the two vocabularies, leveraging sparse representations that are more computationally efficient. The OT matrix, which becomes sparse through iterative projections, is calculated without direct parameterization, reducing the computational burden and enhancing scalability.
Experimental Evaluation
Experiments conducted using English LLMs on the UniRef50 protein sequence dataset provide strong evidence for the efficacy of S2T2. The results indicate that S2T2 achieves superior performance in perplexity and bits-per-byte (BpB) metrics compared to baseline methods, which include unconstrained and dense translation models. Specifically, S2T2 provides an effective initialization for continual fine-tuning, enhancing LLM quality and compression.
Remarkably, the S2T2 approach not only outperforms simple fine-tuning with either the original or new tokenizers but also demonstrates successful parameter transferability between models of different sizes. Notably, a translator learned on a smaller model (OLMo-1B) was effectively applied to a larger model (OLMo-7B), underscoring the potential for scalable model refinement without the need for extensive retraining.
Implications and Future Directions
The S2T2 method holds substantial implications for the development of more adaptable, multi-domain LLMs. Practically, this research suggests pathways to reduce inference costs and improve semantic coherence across diverse datasets, which is invaluable for applications in bioinformatics, code translation, and beyond. Theoretically, it contributes to the discourse on probabilistic and sparse methods in machine translation and domain adaptation, challenging conventional reliance on parallel corpora.
Future research could explore the integration of S2T2 with various modalities beyond proteins, such as code or image data, further extending the utility of LLMs. Additionally, the exploration of unified multidomain tokenizers that combine multiple domain vocabularies remains an intriguing, yet challenging, avenue for further investigation.
In summary, the approach provides a rigorous, scalable mechanism for expanding the applicability of LLMs, ensuring their utility in increasingly complex, diverse environments.