Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities (2406.10851v2)

Published 16 Jun 2024 in cs.CL

Abstract: Predictions of word-by-word conditional probabilities from Transformer-based LLMs are often evaluated to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the most common method of aggregating subword probabilities of such LLMs into word probabilities. This is due to the fact that tokens in the subword vocabulary of most LLMs have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in distributions over word probabilities that sum to more than one, thereby violating the axiom that $\mathsf{P}(\Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the end of the current word is incorrectly carried over to the next word. Additionally, this implicit prediction of word boundaries incorrectly models psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. Experiments show that this correction reveals lower estimates of garden-path effects in transitive/intransitive sentences and poorer fits to naturalistic reading times.

Authors (2)

Byung-Doh Oh (9 papers)
William Schuler (15 papers)

Citations (7)

View on Semantic Scholar

Summary

The paper demonstrates that leading whitespaces in subword tokenization disrupt correct word probability allocation, causing probabilities to sum above one.
It introduces a novel Whitespace-Trailing decoding method that reallocates probability mass, reducing discrepancies between LM outputs and human reading metrics.
Empirical case studies, including garden-path sentences, confirm that the revised approach yields surprisal estimates more reflective of human language processing.

Leading Whitespace Confounds in LLMs: Implications for Word Probabilities

The paper, "Leading Whitespaces of LLMs' Subword Vocabulary Poses a Confound for Calculating Word Probabilities," addresses a fundamental issue concerning the calculation of word-by-word conditional probabilities in Transformer-based LLMs (LMs). The authors, Byung-Doh Oh and William Schuler, elucidate how the incorporation of leading whitespaces in the tokenization scheme can lead to inconsistencies and unexpected confounds with far-reaching implications, particularly in the context of computational linguistics and cognitive modeling.

The Problem with Leading Whitespace

Central to their argument is the observation that leading whitespaces, a characteristic of subword tokenization schemes such as Byte-Pair Encoding (BPE), disrupt the correct allocation of word probabilities. The authors prove that this disruption can cause the calculated probabilities to sum to more than one, thus contradicting the fundamental probability axiom $\mathsf{P}(\Omega) = 1$ . This violation implies that LMs, by default, misallocate word-by-word surprisals.

Implications for Cognitive Modeling

The paper highlights the implications of this confound through its effect on modeling the incremental processing difficulty of human readers. Psycholinguistic experiments involving human subjects reveal precise upcoming word boundaries, a contrast to LLMs that implicitly predict word boundaries without considering these probabilities. The authors propose that this incongruity could be a contributing factor to the observed underprediction of processing difficulties and garden-path effects by LMs in cognitive tasks.

Proposed Solution: Whitespace-Trailing Decoding

To mitigate this issue, the authors introduce a novel decoding technique termed Whitespace-Trailing (WT) decoding. This method reassigns the probability of leading whitespaces to the trailing whitespaces of the preceding word, thus aligning the token probabilities more accurately. This approach not only addresses the confound but also improves the alignment between LM predictions and human reading time metrics derived from self-paced and eye-tracking reading paradigms.

Case Study: Garden-Path Sentences

The effectiveness of WT decoding is demonstrated through a case paper on garden-path sentences. Garden-path effects involve increased processing difficulty when a reader must reinterpret a sentence upon encountering unexpected syntactic structures. The authors found that applying WT decoding produced significantly different surprisal-based estimates of processing difficulty. Specifically, the revised method lowered the estimated magnitude of garden-path effects in comparison to standard leading-whitespace decoding, indicating a more accurate reflection of human cognitive processing.

Practical and Theoretical Implications

Practically, the findings of this paper prompt a reconsideration of how word probabilities are utilized in applications like syntactic evaluation and reading time prediction. Theoretically, this work challenges the current understanding of probability calculations in LMs and suggests a potential overhaul in how token boundaries are treated during training and inference phases. Addressing this confound may notably impact predictions at low-probability word boundaries, such as phrasal or clausal transitions where punctuation marks are likely candidates.

Future Directions

Future research should explore the applicability of WT decoding across different languages, particularly those with non-whitespace orthographies. Moreover, given that smaller LMs trained on limited data might exhibit more pronounced issues with word boundary predictions, it would be prudent to test this method across varied model sizes and architectures. This will ensure the robustness of WT decoding and its generalizability across different linguistic contexts.

Limitations

The paper is limited to English LLMs and does not extend its findings to other languages or NLP applications beyond cognitive modeling. It also assumes typical usage scenarios involving significant human reading time data. However, the identified issues with word probability calculations should prompt broader evaluations across different settings and tasks.

Conclusion

This paper significantly contributes to the field by identifying and addressing a crucial confound in leading-whitespace tokenization in LMs. The proposed WT decoding method offers a robust solution, potentially improving both theoretical insights and practical applications related to LM interpretability and cognitive modeling. This highlights the necessity of reexamining subword tokenization schemes to ensure accurate and reliable language processing outcomes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tpimentelms/status/1803771800146616375

https://twitter.com/byungdoh/status/1841093476739391666

https://twitter.com/ruipchaves/status/1824088615745712625