Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 90 tok/s

Gemini 2.5 Pro 29 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Tokenization with Factorized Subword Encoding (2306.07764v1)

Published 13 Jun 2023 in cs.CL and cs.AI

Abstract: In recent years, LLMs have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on LLMing and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.

References (45)