Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpreting Learned Feedback Patterns in Large Language Models (2310.08164v5)

Published 12 Oct 2023 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) is widely used to train LLMs. However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term \textit{Learned Feedback Pattern} (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Do recommender systems manipulate consumer preferences? A study of anchoring effects, 2013. URL https://pubsonline.informs.org/doi/10.1287/isre.2013.0497.
  2. Constitutional ai: Harmlessness from ai feedback, 2022.
  3. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  4. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
  5. Interpreting neural networks through the polytope lens, 2022.
  6. Paul Christiano. What failure looks like, 2019. URL https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom.
  7. Deep reinforcement learning from human preferences, 2023.
  8. Sparse autoencoders find highly interpretable features in language models, 2023.
  9. Toy models of superposition, 2022a.
  10. A mathematical framework for transformer circuits. https://transformer-circuits.pub/2021/framework/index.html, 2022b.
  11. Neuron to graph: Interpreting language model neurons at scale. In arXiv, 2023.
  12. Quantifying differences in reward functions, 2021.
  13. Finding neurons in a haystack: Case studies with sparse probing, 2023.
  14. Risks from learned optimization in advanced machine learning systems, 2019.
  15. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pp.  216–225, 2014.
  16. Preprocessing reward functions for interpretability. In NeurIPS Cooperative AI workshop, 2021.
  17. Engineering monosemanticity in toy models, 2022.
  18. Visualizing and understanding recurrent networks, 2015.
  19. Specification gaming: the flip side of ai ingenuity, 2020. URL https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity.
  20. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper_files/paper/2006/file/2d71b2ae158c7c5912cc0bbde2bb9d95-Paper.pdf.
  21. Te-Won Lee. Independent Component Analysis, pp.  27–66. Springer US, Boston, MA, 1998. ISBN 978-1-4757-2851-4. doi: 10.1007/978-1-4757-2851-4_2. URL https://doi.org/10.1007/978-1-4757-2851-4_2.
  22. Understanding learned reward functions, 2020.
  23. Distributed representations of words and phrases and their compositionality, 2013.
  24. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
  25. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  26. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697.
  27. Stephen M. Omohundro. The basic ai drives, 2008. URL https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
  30. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  31. Proximal policy optimization algorithms, 2017. URL http://arxiv.org/abs/1707.06347.
  32. Taking features out of superposition with sparse autoencoders, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
  33. Leandro von Werra. distilbert-imdb, 2023. URL https://huggingface.co/lvwerra/distilbert-imdb. Accessed on September 22, 2023.
  34. TRL: Transformer Reinforcement Learning, 2023. URL https://github.com/huggingface/trl.
  35. Simple synthetic data reduces sycophancy in large language models, 2023.
  36. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37–52, 1987. ISSN 0169-7439. doi: https://doi.org/10.1016/0169-7439(87)80084-9. URL https://www.sciencedirect.com/science/article/pii/0169743987800849. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.
  37. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  38. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. CoRR, abs/2103.15949, 2021. URL https://arxiv.org/abs/2103.15949.
  39. The sparsity and bias of the lasso selection in high dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
  40. Fine-tuning language models from human preferences, 2020.
Citations (3)

Summary

We haven't generated a summary for this paper yet.