Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation (2312.01648v2)

Published 4 Dec 2023 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. To provide a practical and principled answer, we propose to characterize LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. Our results are informative, do not rely on approximations, and are actionable. First, we show that, motivated by our geometric interpretation, we can bypass Llama$2$'s RLHF by controlling its embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Those features alone ($224$ for Mistral-7B/Llama$2$-7B and $560$ for Llama$2$-70B) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: \url{https://github.com/RandallBalestriero/SplineLLM}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Toxic comment classification challenge, 2017.
  2. Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156, 2020a.
  3. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020b.
  4. From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. arXiv preprint arXiv:1810.09274, 2018.
  5. Mad max: Affine spline insights into deep learning. Proceedings of the IEEE, 109(5):704–727, 2020.
  6. The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems, 32, 2019.
  7. Balestriero, R. et al. A spline theory of deep learning. In International Conference on Machine Learning, pp.  374–383. PMLR, 2018.
  8. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  9. Bennett, R. The intrinsic dimensionality of signal collections. IEEE Transactions on Information Theory, 15(5):517–525, 1969.
  10. Transformers learn through gradual rank increase. arXiv preprint arXiv:2306.07042, 2023.
  11. What did you learn to hate? a topic-oriented analysis of generalization in hate speech detection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3477–3490, 2023.
  12. Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  14. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  15. Intrinsic dimension estimation: Relevant techniques and a benchmark framework. Mathematical Problems in Engineering, 2015.
  16. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
  17. A toy model of universality: Reverse engineering how networks learn group operations. arXiv preprint arXiv:2302.03025, 2023.
  18. Free dolly: Introducing the world’s first truly open instruction-tuned llm. online, 2023.
  19. Toward a geometrical understanding of self-supervised contrastive learning. arXiv preprint arXiv:2205.06926, 2022.
  20. Batch normalization provably avoids ranks collapse for randomly initialised deep networks. Advances in Neural Information Processing Systems, 33:18387–18398, 2020.
  21. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022.
  22. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.  2793–2803. PMLR, 2021.
  23. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  24. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  25. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association of Computational Linguistics, 2022.
  26. The low-dimensional linear geometry of contextualized word representations. arXiv preprint arXiv:2105.07109, 2021.
  27. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  28. MaGNET: Uniform sampling from deep generative network manifolds without retraining. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=r5qumLiYwf9.
  29. Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10641–10650, 2022b.
  30. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  31. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
  32. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
  35. Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719, 2020.
  36. Graph construction from data by non-negative kernel regression. In Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.  3892–3896. IEEE, 2020.
  37. Uncovering hidden geometry in transformers via disentangling position and context. arXiv preprint arXiv:2310.04861, 2023.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp.  127063, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
  41. Challenges for toxic comment classification: An in-depth error analysis. arXiv preprint arXiv:1809.07572, 2018.
  42. Accurate minkowski sum approximation of polyhedral models. In 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings., pp.  392–401. IEEE, 2004.
  43. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  44. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  45. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  46. Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Randall Balestriero (91 papers)
  2. Romain Cosentino (12 papers)
  3. Sarath Shekkizhar (13 papers)
Citations (1)

Summary

Characterizing LLMs Geometry Solves Toxicity Detection and Generation

The paper "Characterizing LLMs Geometry Solves Toxicity Detection and Generation" provides a geometric analysis of LLMs, specifically focusing on how these models represent data internally and how such understanding can be leveraged to improve practical applications like toxicity detection. The authors present a detailed paper on the geometric structure of LLMs’ computations, introducing new insights and practical solutions for downstream tasks which are increasingly relevant in human-computer interaction scenarios.

Geometric Insights into LLMs

At the core of this research lies the characterization of LLMs through a geometric perspective. Particularly, the paper explores the workings of the Multi-Head Attention (MHA) mechanism and feedforward networks in LLM layers. The authors provide exact formulations for the intrinsic dimension of MHA embeddings, suggesting these embeddings reside on manifolds defined by the token sequences’ convex hulls. The intrinsic dimension is critical as it dictates the expressiveness and capacity of the LLMs. This intrinsic dimension is manipulated through prompt engineering to control downstream generation tasks, including bypassing reinforcement learning from human feedback (RLHF).

Intrinsic Dimension and Toxicity Generation

The work demonstrates how intrinsic dimensionality, determined by input token interdependencies captured in the attention matrices, impacts the model's output expressivity. A key finding is the ability to increase this dimension through informed prompt manipulation, thereby exploring latent spaces within the model not adjusted by RLHF. The authors illustrate that feeding LLMs with related sentences boosts the intrinsic dimensionality, sometimes resulting in a breach of RLHF protections leading to toxic content generation, highlighting potential vulnerabilities in current RLHF practices.

Spline Operators and Feature Engineering

The authors further explore MLP layers in LLMs by leveraging spline theory to express feedforward computations as Continuous Piecewise Affine (CPA) mappings. This perspective provides a foundational understanding of how LLMs partition the input space and allows for the derivation of seven spline features that effectively capture the geometrical properties of prompt representations. These features encapsulate the model’s response to input across layers, proving robust in discriminating between different input domains and effectively detecting text toxicity.

Practical Implications and Evaluation

Through extensive empirical validation, the authors demonstrate the utility of the geometric features in tasks like toxicity detection. The derived features allow for competitive performance in separating toxic from non-toxic prompts, markedly outperforming commercially available solutions in efficiency and accuracy. Notably, these features when used with simple linear models, achieve high ROC-AUC scores on standard datasets such as Jigsaw’s toxic comment classification challenge, showcasing near-exhaustive coverage of toxic modalities without additional data augmentation or model fine-tuning.

Conclusions and Future Directions

This paper contributes significantly to the field of AI interpretability and application, providing both theoretical insights and practical tools for improvement in toxicity detection and model understanding. The ability to manipulate the expressivity of LLMs through intrinsic geometric features opens avenues for safer AI applications and highlights potential areas for future research. Ensuring robust and comprehensive safety mechanisms against toxicity requires further explorations into geometric interpretations of model structures and beyond RLHF engagements.

Future work may involve extending these geometric analyses to larger and more complex models, potentially uncovering novel features that further enhance model transparency and control. Additionally, integrating this understanding with diverse datasets can help generalize these findings, improving the adaptability and robustness of models in varied real-world scenarios.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com