Bayesian Statistical Modeling with Predictors from LLMs (2406.09012v1)

Published 13 Jun 2024 in cs.CL

Abstract: State of the art LLMs have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.

PDF Abstract

Bayesian Statistical Modeling with Predictors from LLMs

The paper "Bayesian Statistical Modeling with Predictors from LLMs" by Michael Franke, Polina Tsvilodub, and Fausto Carcassi explores the viability of integrating LLMs within Bayesian statistical frameworks. This investigation is motivated by the burgeoning use of LLMs in practical applications where these models frequently serve as proxies for human cognition, especially in tasks demanding nuanced judgment or decision-making.

Methodological Foundations

The authors ground their approach in the context of forced-choice decision tasks associated with pragmatic language use. Specifically, they employ a Bayesian statistical framework to model human data derived from a reference game, a well-established experimental paradigm in psycholinguistics.

The paper stresses the methodological challenges inherent in leveraging LLM-derived predictors for human-like decision tasks. Given that traditional LLM evaluations often rely on benchmark tests with predefined "gold standard" answers, the authors argue for a more holistic assessment method that considers the full distribution of human responses.

Empirical Investigation

The empirical investigation is set up as follows:

Experimental Data: The authors design an experiment where human participants engage in reference games involving contextualized decision making. Data is collected for each condition (production and interpretation) and each item (specific instance of the task).
Pragmatic Modeling: The Rational Speech Act (RSA) model is used as a probabilistic cognitive model to predict human behavior in these reference games. RSA generates condition-level predictions, which serve as a benchmark for evaluating LLM-derived predictions.
LLM Integration: The authors derive item-level scores from an instance of GPT-3.5 and variants of LLaMA2. These scores are then used to construct Bayesian statistical models.

Key Findings

Item-Level Predictions: When item-level data are considered, LLM-derived scores predict variability not observed in human data. Specifically, the LLM-based models generate item-level variance that isn't supported by empirical human data, which invalidates the item-level predictions from LLMs.
Condition-Level Predictions: The authors explore three distinct methods for aggregating item-level scores into condition-level predictions: average-scores, average-probabilities, and average-WTA. Each method brings unique assumptions about how item-level information aggregates to condition-level insights.
- The average-WTA method, which leverages a "winner-takes-all" strategy commonly used in benchmark testing, provides the best predictive fit for aggregate human data despite disparities at the item level.
- The average-scores and average-probabilities methods fail to capture essential variance in the dataset, especially for interpretation tasks.
Generalization across LLMs: The paper examines the generalization of these findings across different LLM backends, specifically various versions of LLaMA2. Their results indicate that while the average-WTA method is fairly robust across different LLMs for production tasks, it fails for interpretation tasks for several models, underscoring the need for model-specific adjustments.

Implications

Explanatory Power: The findings highlight a critical distinction between LLMs and probabilistic cognitive models. While LLMs can mimic human responses to some extent, their explanatory power is limited due to the item-specific nature of their predictions. Probabilistic models like RSA, grounded in more abstract, condition-level reasoning, offer more robust frameworks for generalizable human-like predictions.
Methodological Robustness: The variability and inconsistency in performance measures, particularly the reliance on WTA strategies, suggest that standard benchmark testing may not adequately reflect the nuanced performance required for specific applications. Future research should account for various methods of score aggregation and their empirical validity.
Practical Applications: For applications involving human judgment proxies, such as hybrid neuro-symbolic models or tasks demanding nuanced linguistic reasoning, the findings caution against over-reliance on raw LLM predictions without robust empirical validation. Ensuring each LLM component is rigorously tested against corresponding human data is paramount.

Future Directions

The paper opens several pathways for future research:

Transferability Studies: Investigating the transferability of LLMs across different domains and tasks, emphasizing prompt strategies and model adjustments to enhance predictive robustness.
Broader Dataset Analysis: Extending the methodological framework to more complex datasets and cognitive tasks, thereby increasing the generalizability of the findings.
Human Variability Modeling: Delving deeper into subject-level variability in human data and comparing it with LLM predictions to uncover nuances and improve LLM-based neuro-symbolic models.

In conclusion, the paper by Franke, Tsvilodub, and Carcassi provides a rigorous and insightful analysis of integrating LLM-derived predictions into Bayesian statistical models. While highlighting the limitations of current LLMs in capturing human-like variability at the item level, it also uncovers potential pathways for refining statistical and cognitive modeling approaches using LLMs for more informed and reliable applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Michael Franke (15 papers)
Polina Tsvilodub (8 papers)
Fausto Carcassi (2 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/matspike/status/1802701195997303151