Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Abstract: LLMs drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM
- Toxic comment classification challenge, 2017.
- Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156, 2020a.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020b.
- From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. arXiv preprint arXiv:1810.09274, 2018.
- Mad max: Affine spline insights into deep learning. Proceedings of the IEEE, 109(5):704–727, 2020.
- The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems, 32, 2019.
- Balestriero, R. et al. A spline theory of deep learning. In International Conference on Machine Learning, pp. 374–383. PMLR, 2018.
- Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Bennett, R. The intrinsic dimensionality of signal collections. IEEE Transactions on Information Theory, 15(5):517–525, 1969.
- Transformers learn through gradual rank increase. arXiv preprint arXiv:2306.07042, 2023.
- What did you learn to hate? a topic-oriented analysis of generalization in hate speech detection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3477–3490, 2023.
- Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Intrinsic dimension estimation: Relevant techniques and a benchmark framework. Mathematical Problems in Engineering, 2015.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
- A toy model of universality: Reverse engineering how networks learn group operations. arXiv preprint arXiv:2302.03025, 2023.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. online, 2023.
- Toward a geometrical understanding of self-supervised contrastive learning. arXiv preprint arXiv:2205.06926, 2022.
- Batch normalization provably avoids ranks collapse for randomly initialised deep networks. Advances in Neural Information Processing Systems, 33:18387–18398, 2020.
- Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association of Computational Linguistics, 2022.
- The low-dimensional linear geometry of contextualized word representations. arXiv preprint arXiv:2105.07109, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- MaGNET: Uniform sampling from deep generative network manifolds without retraining. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=r5qumLiYwf9.
- Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10641–10650, 2022b.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
- Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719, 2020.
- Graph construction from data by non-negative kernel regression. In Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 3892–3896. IEEE, 2020.
- Uncovering hidden geometry in transformers via disentangling position and context. arXiv preprint arXiv:2310.04861, 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp. 127063, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
- Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572, 2018.
- Accurate minkowski sum approximation of polyhedral models. In 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings., pp. 392–401. IEEE, 2004.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.