Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Addressing Token Uniformity in Transformers via Singular Value Transformation (2208.11790v2)

Published 24 Aug 2022 in cs.CL

Abstract: Token uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the `token uniformity' problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based LLMs such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks. Our source code is available at https://github.com/hanqi-qi/tokenUni.git.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 6th SemEval@NAACL-HLT 2012, Montréal, Canada, pages 385–393, 2012.
  2. Sem 2013 shared task: Semantic textual similarity. In *SEM 2013, June 13-14, 2013, Atlanta, Georgia, USA, pages 32–43, 2013.
  3. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th SemEval@NAACL-HLT 2015, Denver, Colorado, USA, pages 252–263, 2015.
  4. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th SemEval@NAACL-HLT 2016, San Diego, CA, USA, pages 497–511, 2016.
  5. A latent variable model approach to pmi-based word embeddings. Trans. Assoc. Comput. Linguistics, 4:385–399, 2016.
  6. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  7. Too much in common: Shifting of embeddings in transformer language models and its implications. In NAACL-HLT 2021, Online, pages 5117–5130, 2021.
  8. Semantic re-tuning with contrastive tension. In 9th International Conference on ICLR 2021, Virtual Event, Austria, 2021.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Automatically constructing a corpus of sentential paraphrases. In IWP@IJCNLP, Jeju Island, Korea, 2005.
  11. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. CoRR, abs/2103.03404, 2021.
  12. Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and GPT-2 embeddings. In EMNLP-IJCNLP 2019, Hong Kong, China, pages 55–65, 2019.
  13. Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. In Proceedings of the 36th ICML 2019, Long Beach, California, USA, pages 2073–2082, 2019.
  14. Representation degeneration problem in training natural language generation models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 2019.
  15. Simcse: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6894–6910. Association for Computational Linguistics, 2021. 10.18653/v1/2021.emnlp-main.552.
  16. A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks. In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 1, pages 481–486. IEEE, 2015.
  17. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th ICML, volume 119, pages 3690–3699, 2020.
  18. Staying in shape: learning invariant shape representations using contrastive learning. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, pages 1852–1862, 2021. URL https://proceedings.mlr.press/v161/gu21a.html.
  19. Whiteningbert: An easy unsupervised sentence embedding approach. CoRR, abs/2104.01767, 2021.
  20. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
  21. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  22. On the sentence embeddings from pre-trained language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on EMNLP 2020, Online, November 16-20, 2020, pages 9119–9130, 2020.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  24. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth LREC 2014, Reykjavik, Iceland, pages 216–223, 2014.
  25. All-but-the-top: Simple and effective postprocessing for word representations. In ICLR 2018, Vancouver, BC, Canada, 2018.
  26. Nonlinear random matrix theory for deep learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 2637–2646, 2017.
  27. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on EMNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990, 2019. 10.18653/v1/D19-1410.
  28. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  29. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  30. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on EMNLP 2013, Washington, USA, pages 1631–1642, 2013.
  31. Whitening sentence representations for better semantics and faster retrieval. CoRR, abs/2103.15316, 2021.
  32. Adapting text embeddings for causal inference. In Ryan P. Adams and Vibhav Gogate, editors, Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, volume 124 of Proceedings of Machine Learning Research, pages 919–928. AUAI Press, 2020. URL http://proceedings.mlr.press/v124/veitch20a.html.
  33. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR 2019, New Orleans, LA, USA, 2019.
  34. SBERT-WK: A sentence embedding method by dissecting bert-based word models. IEEE ACM Trans. Audio Speech Lang. Process., 28:2146–2157, 2020.
  35. Improving neural language generation with spectrum control. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
  36. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9929–9939. PMLR, 2020. URL http://proceedings.mlr.press/v119/wang20k.html.
  37. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7:625–641, 2019.
  38. Group normalization. In Proceedings of the ECCV, pages 3–19, 2018.
  39. Mixtape: Breaking the softmax bottleneck efficiently. In NeurIPS 2019, Vancouver, BC, Canada, pages 15922–15930, 2019.
  40. Revisiting representation degeneration problem in language modeling. In Proceedings of the 2020 Conference on findings of EMNLP 2020, Online Event, pages 518–527, 2020.
  41. Isobn: Fine-tuning BERT with isotropic batch normalization. In Thirty-Fifth AAAI 2021, Virtual Event, February 2-9, 2021, pages 14621–14629, 2021.
Citations (13)

Summary

We haven't generated a summary for this paper yet.