Counterfactual Token Generation in Large Language Models (2409.17027v2)

Published 25 Sep 2024 in cs.LG, cs.AI, and cs.CL

Abstract: "Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a LLM, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art LLMs are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any LLM to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by LLMs.

PDF Abstract

Counterfactual Token Generation in LLMs

The paper "Counterfactual Token Generation in LLMs" presents a novel approach for augmenting LLMs with the ability to perform counterfactual token generation. The authors focus on enabling LLMs to generate text while considering hypothetical alternatives to previously generated tokens, addressing a fundamental limitation in current models which lack internal memory or state. This paper outlines the development of a causal model of token generation based on the Gumbel-Max structural causal model (SCM), implementation on the Llama 3 8B-instruct model, and a series of experiments that highlight the efficacy and implications of this approach.

Causal Model of Token Generation

The authors propose a causal augmentation of the autoregressive token generation process, which is intrinsic to LLMs. The Gumbel-Max SCM is central to their approach, allowing the model's sampler to behave counterfactually. By using a combination of the token distribution from the model's neural network and Gumbel noise values, the model can deterministically answer counterfactual questions. The proposed SCM-based methodology requires minimal computational overhead and does not necessitate fine-tuning or prompt engineering, making it practically advantageous.

Experimental Analysis

The experiments conducted cover several key areas: qualitative assessment through narrative generation, quantitative analysis of text similarity, and bias detection in generated census data.

Narrative Generation

An illustrative example demonstrated the effectiveness of counterfactual token generation. By changing the protagonist’s name from “Lyra” to “Maeve” in a generated story, the counterfactual stability of the Gumbel-Max SCM ensured significant similarity between the factual and counterfactual narrative outputs. However, modifications in individual words (protagonist's name, ship’s name) and attributes (such as the word “trusty” or the phrase “endless sea” to “blue”) resulted in noticeable but partial divergence in the generated texts. This highlights the sensitivity of the LLM to these seemingly minor changes.

Text Similarity

Evaluating the similarity between factual and counterfactual text through Levenshtein edit distance showed that counterfactual token generation yields sequences more aligned with the original sequences compared to interventional generation. The analysis included various samplingstrategies such as the top- $k$ and top- $p$ modifications which, though lacking assured counterfactual stability, still displayed practical consistency in terms of prioritizing tokens similar to the original generation.

Bias Analysis

Through the lens of demographic attributes, the authors utilized counterfactual token generation to probe biases in the Llama 3 8B-instruct model. For instance, counterfactual interventions on sex and race within generated census data revealed biases and inconsistencies. A substantial portion of counterfactual individuals experienced shifts in attributes such as income, education, and occupation, suggesting implicit biases in the model's underlying world view.

Income Analysis: Male individuals’ income generally decreased when counterfactually altered to female, whereas the effect on female individuals was less predictable, indicating potential gender bias.
Education Level: Interventions on race showed marked variations in education levels, reflecting inequities mirrored by the model.
Occupational Shifts: Occupational shifts among individuals, particularly from STEM fields to humanities when switching from Asian American to Black or African American, underscored the model's ingrained stereotypes and biases.

Implications and Future Directions

The theoretical contribution of this work—equipping LLMs with counterfactual reasoning abilities—has broad implications for both the practical deployment of AI systems and theoretical advancements in machine learning. Practically, this methodology can enhance bias detection and fairness audits in AI, providing tools to modulate and understand AI behavior under alternate hypothetical conditions.

Future research could explore the sensitivity of counterfactual token generation to different SCMs and investigate alternatives beyond the Gumbel-Max SCM, particularly those with varied counterfactual stability properties. Implementing this methodology across a diverse array of LLM architectures would help delineate differences in the internal models constructed by various LLMs, potentially correlating model scale with counterfactual sensitivity. Additionally, aligning counterfactual generation strategies with human feedback mechanisms presents an exciting avenue for refining LLMs to better understand and utilize causal relationships.

Conclusion

The paper by Chatzi et al. makes significant strides in advancing LLM capabilities beyond deterministic token generation, proposing a practical and theoretically sound framework for counterfactual token generation. Their work opens new pathways in understanding and mitigating biases in AI while pushing the boundaries of machine reasoning and causality.