Improving Self Consistency in LLMs through Probabilistic Tokenization (2407.03678v1)

Published 4 Jul 2024 in cs.CL and cs.LG

Abstract: Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a LLM. Despite these promising findings, modern LLMs have yet to be trained using probabilistic tokenizations. Interestingly, while the tokenizers of these contemporary LLMs have the capability to generate multiple tokenizations, this property remains underutilized. In this work, we propose a novel method to leverage the multiple tokenization capabilities of modern LLM tokenizers, aiming to enhance the self-consistency of LLMs in reasoning tasks. Our experiments indicate that when utilizing probabilistic tokenizations, LLMs generate logically diverse reasoning paths, moving beyond mere surface-level linguistic diversity.We carefully study probabilistic tokenization and offer insights to explain the self consistency improvements it brings through extensive experimentation on 5 LLM families and 4 reasoning benchmarks.

Authors (3)

Ashutosh Sathe (9 papers)
Divyanshu Aggarwal (9 papers)
Sunayana Sitaram (54 papers)

Citations (1)

View on Semantic Scholar

Summary

Improving Self Consistency in LLMs through Probabilistic Tokenization

The paper "Improving Self Consistency in LLMs through Probabilistic Tokenization" presents a novel approach to enhance the self-consistency of LLMs by employing probabilistic tokenization. The research investigates the impact of generating multiple tokenizations for a given input string during the inference phase, thereby facilitating the development of divergent reasoning paths in response to reasoning tasks. The findings indicate that this method can lead to substantial improvements in the self-consistency of transformer-based and non-transformer-based LLMs across several reasoning benchmarks.

Summary of Contributions

The core contribution of the paper is the introduction of a probabilistic tokenization framework that leverages the ability of contemporary LLM tokenizers to produce multiple valid tokenizations of an input string. This process capitalizes on the inherent diversity within tokenization sequences to generate varied reasoning paths.

Tokenization Process: The authors utilize a unigram LLM to estimate the likelihood of each possible tokenization. This estimation is facilitated either by employing a simplified counting method on a subset of a large-scale dataset or by using the Expectation Maximization algorithm for more precise likelihood calculation. This probabilistic treatment allows tokenizations to be sampled according to their estimated likelihoods, introducing robust diversity into the input representation.
Protocol for Diverse Reasoning Paths: By generating multiple tokenizations, the approach circumvents the need for diversity-inducing subsequence sampling methods like Nucleus sampling or temperature annealing. The authors demonstrate that using multiple tokenizations as input results in reasoning paths that are not just superficially diverse in linguistic structures but also in logical constructs, thereby boosting the model's ability to consistently derive accurate conclusions.
Evaluation and Results: The research was tested on a suite of reasoning tasks—MATH, AQuA, GSM8k, and PIQA—across models of varying scales, including those such as OLMo-7B and LLaMA3-8B. The results show that probabilistic tokenization significantly enhances model performance in terms of both minority and majority class accuracy when reasoning paths converge on the correct answer. Notably, the gains in performance were especially pronounced in models where reasoning capabilities are emergent or nascent.

Implications and Future Work

The implications of these findings are substantial for the design and deployment of LLMs in reasoning-centric applications. By introducing variations in the token sequences that a model processes, probabilistic tokenization fosters a tendency for the model to explore a broader range of potential reasoning paths. This methodology provides a pathway for models to improve internal consistency without modifying their inherent architectures or training protocols, thus maintaining their deployment efficiency.

Looking forward, future work could explore the applicability of this approach in domain-specific contexts, such as medical literature or code generation, where the underlying corpus might significantly deviate from the general datasets typically used for LLM training. Additionally, the effect of the probabilistic tokenization technique on other model architectures beyond the transformer and non-transformer paradigms considered could be another promising area of exploration.

The introduction of probabilistic tokenization marks an insightful stride towards more interpretable and reliable AI systems, potentially leading to more nuanced insights into models' reasoning capabilities and decision-making processes. This method opens avenues for subsequent investigations into diverse input adjustment techniques to improve model robustness and output consistency in a rapidly evolving AI landscape.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rami_mmo/status/1827633910617882998

https://twitter.com/secemp9/status/1824083390985031875

https://twitter.com/tunadorable/status/1826575040068157461

https://twitter.com/HessianFree/status/1828920554054922371

YouTube

Show All Videos