Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Coneheads: Hierarchy Aware Attention (2306.00392v2)

Published 1 Jun 2023 in cs.LG

Abstract: Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a hierarchy aware similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. NVIDIA A100 Tensor Core GPU Architecture. Technical report, NVIDIA.
  2. The approximate rank of a matrix and its algorithmic applications. pages 675–684, 06 2013. doi: 10.1145/2488608.2488694.
  3. Sign rank versus vc dimension. In Conference on Learning Theory, pages 47–80. PMLR, 2016.
  4. James W. Anderson. Hyperbolic geometry. Springer, 2007.
  5. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
  6. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, Lake Tahoe, California, December 4-5 2014. URL https://aclanthology.org/2014.iwslt-evaluation.0.
  12. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, pages 1646–1655. PMLR, 2018.
  13. Hyperbolic attention networks. arXiv preprint arXiv:1805.09786, 2018.
  14. Gaussian transformer: A lightweight approach for natural language inference. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6489–6496, Jul. 2019. doi: 10.1609/aaai.v33i01.33016489. URL https://ojs.aaai.org/index.php/AAAI/article/view/4614.
  15. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
  16. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  17. An approximation algorithm for approximation rank. In 2009 24th Annual IEEE Conference on Computational Complexity, pages 351–357. IEEE, 2009.
  18. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  19. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  20. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, Jul 2000. ISSN 1573-7659. doi: 10.1023/A:1009953814988. URL https://doi.org/10.1023/A:1009953814988.
  21. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  22. Hee Oh. Euclidean traveller in hyperbolic worlds. arXiv preprint arXiv:2209.01306, 2022.
  23. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
  24. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  25. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  26. Random feature attention. arXiv preprint arXiv:2103.02143, 2021a.
  27. Hyperbolic deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):10023–10044, 2021b.
  28. Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pages 4460–4469. PMLR, 2018.
  29. Synthesizer: Rethinking self-attention for transformer models. In International conference on machine learning, pages 10183–10192. PMLR, 2021.
  30. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  31. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  32. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344–4353, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1443. URL https://aclanthology.org/D19-1443.
  33. Automatic synthesis of diverse weak supervision sources for behavior analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2211–2220, June 2022.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  36. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  37. Hierarchical interpretation of neural text classification. Computational Linguistics, 48(4):987–1020, 2022.
  38. Numerically accurate hyperbolic embeddings using tiling-based models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file82c2559140b95ccda9c6ca4a8b981f1e-Paper.pdf.
  39. Representing hyperbolic space accurately using multi-component floats. Advances in Neural Information Processing Systems, 34:15570–15581, 2021.
  40. Shadow cones: Unveiling partial orders in hyperbolic space. arXiv preprint, 2023.
  41. Efficient attention via control variates. arXiv preprint arXiv:2302.04542, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.