Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-word Tokenization for Sequence Compression (2402.09949v2)

Published 15 Feb 2024 in cs.CL and cs.LG

Abstract: LLMs have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
  3. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Fast vocabulary transfer for language model compression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 409–416, Abu Dhabi, UAE. Association for Computational Linguistics.
  6. Deep learning with limited numerical precision. In International conference on machine learning, pages 1737–1746. PMLR.
  7. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics, 45(5):885 – 892. Text Mining and Natural Language Processing in Pharmacogenomics.
  8. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
  9. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
  10. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  11. Dipesh Kumar and Avijit Thawani. 2022. Bpe beyond word boundary: How not to use multi word expressions in neural machine translation. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179.
  12. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755.
  13. Are sixteen heads really better than one? Advances in neural information processing systems, 32.
  14. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467.
  15. OpenAI. 2023. Gpt-4 technical report.
  16. Pre-tokenization of multi-word expressions in cross-lingual word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4451–4464, Online. Association for Computational Linguistics.
  17. Language model tokenizers introduce unfairness between languages. arXiv preprint arXiv:2305.15425.
  18. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
  19. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  20. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
  21. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, Florence, Italy. Association for Computational Linguistics.
  22. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821.
  23. MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  25. Ledgar: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1235–1241.
  26. Attention is all you need. Advances in neural information processing systems, 30.
  27. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  28. Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Leonidas Gee (5 papers)
  2. Leonardo Rigutini (16 papers)
  3. Marco Ernandes (5 papers)
  4. Andrea Zugarini (22 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets