Improving Self Consistency in LLMs through Probabilistic Tokenization
The paper "Improving Self Consistency in LLMs through Probabilistic Tokenization" presents a novel approach to enhance the self-consistency of LLMs by employing probabilistic tokenization. The research investigates the impact of generating multiple tokenizations for a given input string during the inference phase, thereby facilitating the development of divergent reasoning paths in response to reasoning tasks. The findings indicate that this method can lead to substantial improvements in the self-consistency of transformer-based and non-transformer-based LLMs across several reasoning benchmarks.
Summary of Contributions
The core contribution of the paper is the introduction of a probabilistic tokenization framework that leverages the ability of contemporary LLM tokenizers to produce multiple valid tokenizations of an input string. This process capitalizes on the inherent diversity within tokenization sequences to generate varied reasoning paths.
- Tokenization Process: The authors utilize a unigram LLM to estimate the likelihood of each possible tokenization. This estimation is facilitated either by employing a simplified counting method on a subset of a large-scale dataset or by using the Expectation Maximization algorithm for more precise likelihood calculation. This probabilistic treatment allows tokenizations to be sampled according to their estimated likelihoods, introducing robust diversity into the input representation.
- Protocol for Diverse Reasoning Paths: By generating multiple tokenizations, the approach circumvents the need for diversity-inducing subsequence sampling methods like Nucleus sampling or temperature annealing. The authors demonstrate that using multiple tokenizations as input results in reasoning paths that are not just superficially diverse in linguistic structures but also in logical constructs, thereby boosting the model's ability to consistently derive accurate conclusions.
- Evaluation and Results: The research was tested on a suite of reasoning tasks—MATH, AQuA, GSM8k, and PIQA—across models of varying scales, including those such as OLMo-7B and LLaMA3-8B. The results show that probabilistic tokenization significantly enhances model performance in terms of both minority and majority class accuracy when reasoning paths converge on the correct answer. Notably, the gains in performance were especially pronounced in models where reasoning capabilities are emergent or nascent.
Implications and Future Work
The implications of these findings are substantial for the design and deployment of LLMs in reasoning-centric applications. By introducing variations in the token sequences that a model processes, probabilistic tokenization fosters a tendency for the model to explore a broader range of potential reasoning paths. This methodology provides a pathway for models to improve internal consistency without modifying their inherent architectures or training protocols, thus maintaining their deployment efficiency.
Looking forward, future work could explore the applicability of this approach in domain-specific contexts, such as medical literature or code generation, where the underlying corpus might significantly deviate from the general datasets typically used for LLM training. Additionally, the effect of the probabilistic tokenization technique on other model architectures beyond the transformer and non-transformer paradigms considered could be another promising area of exploration.
The introduction of probabilistic tokenization marks an insightful stride towards more interpretable and reliable AI systems, potentially leading to more nuanced insights into models' reasoning capabilities and decision-making processes. This method opens avenues for subsequent investigations into diverse input adjustment techniques to improve model robustness and output consistency in a rapidly evolving AI landscape.