A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors (2406.10203v4)
Abstract: The relationship between the quality of a string, as judged by a human reader, and its probability, $p(\boldsymbol{y})$ under a LLM undergirds the development of better LLMs. For example, many popular algorithms for sampling from a LLM have been conceived with the goal of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in LLMs explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned LLM, there exists a trade-off between the strings' average reward and average log-likelihood under the prior LLM, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.