Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Locality and Symmetry of Positional Encodings (2310.12864v1)

Published 19 Oct 2023 in cs.CL

Abstract: Positional Encodings (PEs) are used to inject word-order information into transformer-based LLMs. While they can significantly enhance the quality of sentence representations, their specific contribution to LLMs is not fully understood, especially given recent findings that various positional encodings are insensitive to word order. In this work, we conduct a systematic study of positional encodings in \textbf{Bidirectional Masked LLMs} (BERT-style) , which complements existing work in three aspects: (1) We uncover the core function of PEs by identifying two common properties, Locality and Symmetry; (2) We show that the two properties are closely correlated with the performances of downstream tasks; (3) We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly. We believe that these results are the basis for developing better PEs for transformer-based LLMs. The code is available at \faGithub~ \url{https://github.com/tigerchen52/locality\_symmetry}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Word order does matter and shuffled language models know it. In Proc. of ACL.
  2. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proc. of ICLR.
  3. The fifth pascal recognizing textual entailment challenge. In TAC 2009 Workshop. no publisher.
  4. A large annotated corpus for learning natural language inference. In Proc. of EMNLP.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017).
  7. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.
  8. Local structure matters most: Perturbation study in NLU. In Findings of the Association for Computational Linguistics: ACL 2022.
  9. Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  10. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proc. of ACL.
  11. The pascal recognising textual entailment challenge. In Machine learning challenges workshop. Springer.
  12. Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. of ACL.
  13. Universal transformers. In Proc. of ICLR.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT.
  15. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  16. Position information in transformers: An overview. Computational Linguistics, (3).
  17. William Edward Dyer. 2017. Minimizing integration cost: A general theory of constituent order. Ph.D. thesis, University of California, Davis.
  18. Good-enough representations in language comprehension. Current directions in psychological science, (1).
  19. Dependency locality as an explanatory principle for word order. Language, (2).
  20. Convolutional sequence to sequence learning. In Proc. of ICML, Proceedings of Machine Learning Research.
  21. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing.
  22. Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proceedings of the National Academy of Sciences, (20).
  23. BERT & family eat word salad: Experiments with text understanding. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021.
  24. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.
  25. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022.
  26. Deberta: decoding-enhanced bert with disentangled attention. In Proc. of ICLR.
  27. What does BERT learn about the structure of language? In Proc. of ACL.
  28. The impact of positional encoding on length generalization in transformers. ArXiv preprint.
  29. Rethinking positional encoding in language pre-training. In Proc. of ICLR.
  30. Sharp nearby, fuzzy far away: How neural language models use context. In Proc. of ACL.
  31. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
  32. Multilingual constituency parsing with self-attention and pre-training. In Proc. of ACL.
  33. Roger Levy. 2008. A noisy-channel model of human sentence comprehension under uncertain input. In Proc. of EMNLP.
  34. Dependency distance: A new perspective on syntactic patterns in natural languages. Physics of life reviews.
  35. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14).
  36. MistralAI. 2023. Mistral 7b. ArXiv preprint.
  37. Composition is the core driver of the language-selective network. Neurobiology of Language, (1).
  38. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proc. of ACL.
  39. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. of ACL.
  40. GloVe: Global vectors for word representation. In Proc. of EMNLP.
  41. Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
  42. Shortformer: Better language modeling using shorter inputs. In Proc. of ACL.
  43. Train short, test long: Attention with linear biases enables input length extrapolation. In Proc. of ICLR.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.
  45. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP.
  46. Self-attention with relative position representations. In Proc. of NAACL-HLT.
  47. Does string-based neural MT learn source syntax? In Proc. of EMNLP.
  48. Vighnesh Leonardo Shiv and Chris Quirk. 2019. Novel positional encodings to enable tree-based transformers. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada.
  49. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proc. of EMNLP.
  50. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP.
  51. Roformer: Enhanced transformer with rotary position embedding. ArXiv preprint.
  52. Sho Takase and Naoaki Okazaki. 2019. Positional encoding to control output sequence length. In Proc. of NAACL-HLT.
  53. David Temperley and Daniel Gildea. 2018. Minimizing syntactic dependency lengths: Typological/cognitive universal? Annual Review of Linguistics.
  54. Llama: Open and efficient foundation language models. ArXiv preprint.
  55. Matthew J Traxler. 2014. Trends in syntactic parsing: Anticipation, bayesian estimation, and good-enough parsing. Trends in cognitive sciences, (11).
  56. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA.
  57. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR.
  58. On position embeddings in BERT. In Proc. of ICLR.
  59. Encoding word order in complex embeddings. In Proc. of ICLR.
  60. Yu-An Wang and Yun-Nung Chen. 2020. What do position embeddings learn? an empirical study of pre-trained language model positional encoding. In Proc. of EMNLP.
  61. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL-HLT.
  62. Transformers: State-of-the-art natural language processing. In Proc. of EMNLP.
  63. Efficient streaming language models with attention sinks. ArXiv preprint.
  64. Assessing the ability of self-attention networks to learn word order. In Proc. of ACL.
  65. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lihu Chen (12 papers)
  2. Gaël Varoquaux (87 papers)
  3. Fabian M. Suchanek (12 papers)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com