Precise measurement of human perplexity in next-token prediction

Develop a precise methodology to measure human perplexity—i.e., the cross-entropy of human next-token predictive distributions—on web text next-token prediction tasks such as OpenWebText tokenized with a 50,000-token vocabulary, so that human perplexity can be estimated accurately rather than indirectly via pairwise probability ratios and assumptions about calibration.

Background

The paper compares human and LLM performance on next-token prediction using two experiments: top-1 accuracy and an indirect estimation of human perplexity via pairwise probability ratio judgments with importance sampling on OpenWebText. While the authors find models outperform humans, they emphasize methodological limitations in estimating human perplexity, including calibration difficulties, discretized response options, and heavy-tailed estimators that can underestimate loss.

In their results discussion for the perplexity experiment, the authors explicitly acknowledge that they do not have a good method for precise measurement of human perplexity, noting that their approach may systematically misestimate it. This leaves open the development of a robust, precise, and validated procedure for measuring human perplexity on typical internet text.

References

Thus, while we don’t have a good way to precisely measure human perplexity, these results give reasonable evidence that it is high.

Language models are better than humans at next-token prediction (2212.11281 - Shlegeris et al., 2022) in Section 4.2 (Results) of Measuring human perplexity