Challenges with unsupervised LLM knowledge discovery (2312.10029v2)

Published 15 Dec 2023 in cs.LG and cs.AI

Abstract: We show that existing unsupervised methods on LLM activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.

Citations (20)

View on Semantic Scholar

Summary

The paper shows that unsupervised methods like CCS can misinterpret general consistency patterns as genuine knowledge.
It reveals through experiments that distracting or biased prompts often lead to misclassification of non-factual features.
The study calls for rigorous empirical frameworks to better isolate true latent knowledge from arbitrary model activations.

Introduction

LLMs are known for their impressive performance on various textual tasks, suggesting they encapsulate significant information about the world. However, tapping into this "knowledge" has proven difficult. While the responses generated by LLMs may contain accurate facts, they can also propagate misconceptions or deliver strategically deceptive outputs. This issue has spurred efforts to unveil the latent knowledge of these models. An innovative algorithm called Contrast-Consistent Search (CCS) has been presented, aiming to uncover knowledge within LLMs by recognizing that such knowledge typically follows a consistency pattern. Yet, doubts remain about the efficacy of such unsupervised methods in reliably identifying deep-seated knowledge.

Probing Unsupervised Knowledge Detection

The concept behind unsupervised knowledge identification is founded on the premise that genuine knowledge adheres to a structure of consistency within the model's activations, which should, in theory, enable discovery of this knowledge. This new paper challenges the effectiveness of this assumption. It begins with a theoretical examination, proving that arbitrary features—not just knowledge—can meet the consistency requirements of CCS algorithms. These theoretical observations suggest that the methods may not be as precisely targeted at uncovering knowledge as previously thought.

Experimental Analysis

The practical side of the paper involves experiments that demonstrate unsupervised methods end up classifying prominent features other than knowledge. When exposed to distracting elements or prompts containing explicit or implicit opinions, these methods often predicted these irrelevant features instead of factual content. These flaws weren't limited to one type of model or dataset, and even variations in prompt design significantly affected the outcome, further underscoring the sensitivity of current unsupervised methods to factors outside of knowledge representation. Additionally, CCS demonstrated predictions similar to simpler methods like Principal Component Analysis (PCA), casting doubt on the efficacy of leveraging consistency structures to find true knowledge.

Future Directions and Conclusion

The shortcomings of current unsupervised methods highlight the need for comprehensive sanity checks in future knowledge elicitation research and, more broadly, for strategies addressing the identification of LLM knowledge. Future attempts at unsupervised knowledge detection are likely to face similar stumbling blocks unless they can effectively differentiate between a model's general patterns and its representation of knowledge. The authors call for empirical frameworks that can help refine the process of revealing latent knowledge within LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SamuelAlbanie/status/1752022969319129401

https://twitter.com/2487828968/status/1736770109866147876

https://twitter.com/dpaleka/status/1763282850965852438

https://twitter.com/22146921/status/1736864575600148660

https://twitter.com/1690289996836847616/status/1736672658476150968

https://twitter.com/1724061858221654016/status/1736762229788409976

YouTube

Show All Videos

HackerNews

Challenges with unsupervised LLM knowledge discovery (2 points, 3 comments)