Monitoring Latent World States in Language Models with Propositional Probes (2406.19501v2)

Published 27 Jun 2024 in cs.CL and cs.LG

Abstract: LLMs are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of LLMs could help monitor and correct unfaithful behavior. We hypothesize that LLMs represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do not (''Greg'' and ''physicist''). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where LLMs respond unfaithfully to the input context -- prompt injections, backdoor attacks, and gender bias -- the decoded propositions remain faithful. This suggests that LLMs often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces propositional probes to decode latent world states in language models, enhancing interpretability under adversarial conditions.
It employs a Hessian-based algorithm to identify a binding subspace that enables extraction of logical propositions from internal activations.
Empirical tests confirm high exact-match accuracy and robust performance across synthetic, paraphrased, and multilingual datasets.

Monitoring Latent World States in LLMs with Propositional Probes

The paper "Monitoring Latent World States in LLMs with Propositional Probes" by Jiahai Feng, Stuart Russell, and Jacob Steinhardt, presents an innovative approach to improving the interpretability and faithfulness of LLMs (LMs). The authors introduce the concept of propositional probes as a means to extract and monitor latent world states within LMs, potentially providing higher fidelity outputs even when the LMs are otherwise unfaithful to their input contexts.

The core hypothesis underpinning the research is that LMs encapsulate their input contexts within a latent world model, which can be probed and decoded for logical propositions. Propositional probes aim to extract these world states by interpreting the internal activations of the LMs. For instance, given a context like "Greg is a nurse. Laura is a physicist," the approach extracts the propositions WorksAs(Greg, nurse) and WorksAs(Laura, physicist) from the model's activations.

The researchers developed this method by identifying a "binding subspace" within the activation space, where bound tokens exhibit high similarity. The binding subspace was determined using a novel Hessian-based algorithm. They trained simple linear domain probes for various lexical fields (names, countries, occupations, and foods) and constructed propositions by pairing the results via the binding similarity metric derived from the binding subspace.

An important aspect validated in the paper is that despite being trained on synthetic templated contexts, the propositional probes generalized effectively to more complex datasets, including paraphrased contexts and translations into Spanish.

Methodology

Probing Internal Activations:

The process begins with the creation of synthetic contexts populated with random propositions about entities (e.g., names, countries, occupations, foods). Probes trained on these contexts extract domain-specific lexical information. The critical step involves interpreting these domain probes to form logical propositions by leveraging a binding similarity metric.

Hessian-based Algorithm for Binding Subspace:

The binding subspace was identified using second-order derivatives of binding strength measures between entity and attribute representations. The algorithm evaluates binding strength by perturbing unbound activations and measuring the resultant changes in model behavior, specifically their probability outputs for queries related to the entities and attributes.

Validation and Generalizability:

Propositional probes were tested on various datasets, demonstrating robust generalization from the synthetic training data to more complex contexts. For example, rewriting templated data into short stories or translating them into Spanish did not significantly degrade the accuracy of propositions decoded by the probes.

Results

Quantitative evaluations showed high accuracy of propositional probes across datasets. The accuracy metrics included exact-match accuracy and Jaccard index for evaluating how closely the decoded propositions matched the ground-truth sets. The propositional probes delivered performance comparable to a prompting baseline, which queried the LM iteratively about the context.

Practical Applications:

Prompt Injection: When tested under prompt injections, the propositional probes maintained high fidelity, while standard prompting techniques' performance degraded.
Backdoor Attacks: Propositional probes also outperformed during backdoor attacks designed to finetune the model to provide inaccurate outputs on specific triggers.
Gender Bias: Propositional probes exhibited significantly less bias in decoding the gender of professions compared to standard prompting methods, indicating their potential to mitigate biases inherent in LMs.

Implications and Future Work

The theoretical and practical implications of this research are substantial. It underscores the importance of improving LM interpretability and faithfulness through better understanding and extraction of their internal latent states. Practically, the effectiveness of propositional probes in maintaining fidelity under adversarial conditions highlights their potential utility as monitoring tools during LM deployment.

Future work could involve scaling the propositional probing technique to more complex and less structured domains, further exploring the binding mechanisms in LMs, and enhancing the probes to cover broader aspects of semantic representation, such as role-filler bindings and state changes.

In sum, the paper makes a critical contribution to the field of AI, particularly in the interpretability and robustness of LMs, laying groundwork for safer and more reliable use of these models in real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/feng_jiahai/status/1807838795225784579

https://twitter.com/jacobandreas/status/1813662931923751111

https://twitter.com/yudapearl/status/1811161377245122714

https://twitter.com/realmofresearch/status/1807775752437534990