Language Model Evaluation Beyond Perplexity (2106.00085v3)
Abstract: We propose an alternate approach to quantifying how well LLMs learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from LLMs exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework--paired with significance tests--for evaluating the fit of LLMs to these trends. We find that neural LLMs appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type--token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.