Do long, complex words drive higher marginalization gaps

Determine whether the higher differences in Bits Per Character (BPC) between default-tokenization scoring and marginalized scoring in autoregressive language models are driven, at least in part, by the presence of long, complex words in the evaluation datasets.

Background

The paper investigates the impact of marginalizing over all valid tokenizations when computing LLM probabilities, comparing default-tokenization scoring to importance-sampled estimates of the marginalized probability across datasets and languages.

Their experiments with GPT-2 and BLOOM show that while the overall relative BPC gap is often under 0.5%, larger gaps occur in certain settings. They observe a higher gap associated with a greater proportion of non-default tokenizations and analyze the probability of sampling a default tokenization across blocks (roughly words), finding that simple, frequent words tend to have high default-tokenization probability while complex, rare words tend to have lower probability.

Based on these observations, they explicitly conjecture that the presence of long complex words in datasets contributes to higher gaps, indicating an unresolved question about the underlying drivers of the marginalization effect.

References

From this observation, we conjecture that higher gaps are at least in part driven by the presence of long complex words in the datasets.

— Should you marginalize over possible tokenizations? (2306.17757 - Chirkova et al., 2023) in Section 3, Results (following Table 2)

Do long, complex words drive higher marginalization gaps

Background

References

Related Problems