- The paper introduces Contrast-Consistent Search (CCS) to extract latent truth from internal activations without human supervision.
- It demonstrates a 4% improvement over zero-shot accuracy and maintains performance even against misleading prompts.
- The method generalizes across tasks, highlighting its potential for reliable applications in critical domains like medicine and law.
Discovering Latent Knowledge in LLMs
Introduction
LLMs, like GPT-3 and BERT, are widely used in various applications, such as chatbots, machine translation, and sentiment analysis. However, these models can sometimes generate text that is not entirely truthful. This issue arises from misalignment between their training objectives and the truth. For instance, a model trained to imitate human text may reproduce common misconceptions, or a model trained to optimize for engagement might generate compelling but false information.
A paper proposes a unique method for addressing this problem by directly finding latent knowledge within the internal activations of a LLM, instead of relying on model outputs or human supervision. This approach, called Contrast-Consistent Search (CCS), aims to accurately answer yes-no questions by identifying a direction in activation space that satisfies logical consistency properties.
Problem Statement and Framework
Discovering Latent Knowledge
The central problem tackled by the paper is to answer yes-no questions using only the internal hidden representations (activations) of a pre-trained LLM, without relying on model outputs or external supervision. The goal is to determine whether the internal activations of these models contain usable knowledge about the truth of certain statements.
Method: Contrast-Consistent Search (CCS)
The CCS method operates by finding a linear projection in the activation space of LLMs that reflects truth. Here’s how it works:
- Construct Contrast Pairs: For each yes-no question, generate two natural language answers, one stating "Yes" and the other "No."
- Feature Extraction and Normalization: Obtain and normalize the hidden activations for both answers.
- Mapping Activations to Probabilities: Learn a mapping that transforms these activations into probabilities of the answers being true, ensuring consistency and confidence in these probabilities.
- Optimization: Train this mapping using an unsupervised loss that encourages logical consistency (i.e., the sum of probabilities for a statement and its negation should be 1) and high confidence in these probabilities.
- Inference: Use the trained mapping to answer new questions by averaging the probabilities of "Yes" and "No" answers and choosing the higher one.
Experimental Results
CCS was evaluated on six different LLMs (like T5, UnifiedQA, and GPT-J) using 10 diverse datasets covering tasks such as sentiment classification and natural language inference. Here are some notable results and findings:
- Performance: CCS outperformed the zero-shot accuracy of these models by an average of 4%. For instance, on UnifiedQA, the average zero-shot accuracy was 80.4%, while CCS achieved 82.1%.
- Robustness: CCS was less sensitive to different prompts and maintained high accuracy even when the models were deliberately misled with incorrect prompt prefixes. For example, in UnifiedQA, despite a 9.5% drop in zero-shot accuracy due to misleading prompts, CCS's accuracy remained stable.
- Transferability: CCS exhibited strong performance even when transferred across different tasks, indicating that CCS might be discovering a task-agnostic representation of truth within the models.
Implications
Practical Implications
With the ability to uncover latent truth in LLMs, CCS could be crucial for applications where the accuracy and truthfulness of model outputs are paramount. This includes areas like medical diagnosis, legal document analysis, and automated fact-checking, where incorrect information can have significant consequences.
Theoretical Implications
The results suggest that pretrained LLMs might inherently develop internal representations that align with truth, even though their outputs might sometimes be false. This could open new research avenues in understanding how these internal representations develop and how they can be leveraged for various tasks.
Future Directions
The framework presented can be expanded and refined in numerous ways:
- Additional Consistency Constraints: Incorporating more sophisticated logical constraints could further improve the method's ability to detect truths.
- Generalization Beyond Yes-No Questions: Extending the method to handle more complex types of questions and statements.
- Calibration and Robustness: Enhancing the calibration of the probabilistic outputs and further improving the method's robustness against adversarial prompts.
Conclusion
Contrast-Consistent Search (CCS) provides an innovative way to discover latent knowledge within LLMs without relying on model outputs or external supervision. The strong empirical results demonstrate that this method can exceed zero-shot performance, maintain robustness against misleading information, and find task-agnostic truth representations in model activations. This approach opens the door to new ways of ensuring the reliability and truthfulness of AI systems without constant human oversight.