Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3 (2304.14780v1)

Published 28 Apr 2023 in cs.CL and cs.AI

Abstract: This paper provides a detailed discussion of the multilingual tokenizer used for GPT-SW3. It was trained on the Nordic Pile using the SentencePiece library and the BPE algorithm. We outline the tokenizer's most important features and share details on its learned vocabulary. In addition, we systematically analyze the properties and evaluate the performance of the tokenizer with regard to the different languages present in the data.

Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)