Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement (2403.13754v1)

Published 20 Mar 2024 in cs.CL

Abstract: The relationship between LLM tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that LLM embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.

References (19)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/gm8xx8/status/1770629998929490406

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement (2403.13754v1)

Summary

Related Papers

Tweets