Do long, complex words drive higher marginalization gaps
Determine whether the higher differences in Bits Per Character (BPC) between default-tokenization scoring and marginalized scoring in autoregressive language models are driven, at least in part, by the presence of long, complex words in the evaluation datasets.
References
From this observation, we conjecture that higher gaps are at least in part driven by the presence of long complex words in the datasets.
— Should you marginalize over possible tokenizations?
(2306.17757 - Chirkova et al., 2023) in Section 3, Results (following Table 2)