Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation (2312.01648v2)
Abstract: LLMs~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. To provide a practical and principled answer, we propose to characterize LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. Our results are informative, do not rely on approximations, and are actionable. First, we show that, motivated by our geometric interpretation, we can bypass Llama$2$'s RLHF by controlling its embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Those features alone ($224$ for Mistral-7B/Llama$2$-7B and $560$ for Llama$2$-70B) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: \url{https://github.com/RandallBalestriero/SplineLLM}.
- Toxic comment classification challenge, 2017.
- Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156, 2020a.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020b.
- From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. arXiv preprint arXiv:1810.09274, 2018.
- Mad max: Affine spline insights into deep learning. Proceedings of the IEEE, 109(5):704–727, 2020.
- The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems, 32, 2019.
- Balestriero, R. et al. A spline theory of deep learning. In International Conference on Machine Learning, pp. 374–383. PMLR, 2018.
- Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Bennett, R. The intrinsic dimensionality of signal collections. IEEE Transactions on Information Theory, 15(5):517–525, 1969.
- Transformers learn through gradual rank increase. arXiv preprint arXiv:2306.07042, 2023.
- What did you learn to hate? a topic-oriented analysis of generalization in hate speech detection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3477–3490, 2023.
- Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Intrinsic dimension estimation: Relevant techniques and a benchmark framework. Mathematical Problems in Engineering, 2015.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
- A toy model of universality: Reverse engineering how networks learn group operations. arXiv preprint arXiv:2302.03025, 2023.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. online, 2023.
- Toward a geometrical understanding of self-supervised contrastive learning. arXiv preprint arXiv:2205.06926, 2022.
- Batch normalization provably avoids ranks collapse for randomly initialised deep networks. Advances in Neural Information Processing Systems, 33:18387–18398, 2020.
- Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association of Computational Linguistics, 2022.
- The low-dimensional linear geometry of contextualized word representations. arXiv preprint arXiv:2105.07109, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- MaGNET: Uniform sampling from deep generative network manifolds without retraining. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=r5qumLiYwf9.
- Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10641–10650, 2022b.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
- Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719, 2020.
- Graph construction from data by non-negative kernel regression. In Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 3892–3896. IEEE, 2020.
- Uncovering hidden geometry in transformers via disentangling position and context. arXiv preprint arXiv:2310.04861, 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp. 127063, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
- Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572, 2018.
- Accurate minkowski sum approximation of polyhedral models. In 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings., pp. 392–401. IEEE, 2004.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029, 2023.
- Randall Balestriero (91 papers)
- Romain Cosentino (12 papers)
- Sarath Shekkizhar (13 papers)