Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities (2401.11143v4)

Published 20 Jan 2024 in cs.LG, cs.AI, cs.CL, cs.CV, cs.SD, eess.AS, and eess.SP

Abstract: We propose the Multi-Head Density Adaptive Attention Mechanism (DAAM), a novel probabilistic attention framework that can be used for Parameter-Efficient Fine-tuning (PEFT), and the Density Adaptive Transformer (DAT), designed to enhance information aggregation across multiple modalities, including Speech, Text, and Vision. DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework, enabling it to collectively model any probability distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance, up to approximately +20% (abs.) in accuracy. Empirically, DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling data across multiple modalities. Furthermore, we introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  2. L. Wang, “Rrwkv: Capturing long-range dependencies in rwkv,” arXiv preprint arXiv:2306.05176, 2023.
  3. Y. Zhuang, J. Zhang, and M. Tu, “Long-range sequence modeling with predictable sparse attention,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 271–281.
  4. H. He, “A unified view of long-sequence models towards million-scale dependencies,” arXiv preprint arXiv:2307.03172, 2023.
  5. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  6. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
  7. H. Touvron and L. M. et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  8. H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” 2022.
  9. S. Chen, J. Xie, and J. H. L. Hansen, “Fearless: Feature refinement loss for ensembling self-supervised learning features in robust end-to-end speech recognition,” in Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, H. Ko and J. H. L. Hansen, Eds.   ISCA, 2022.
  10. Y. Li, “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering,” arXiv preprint arXiv:2304.12102, 2023. [Online]. Available: https://dx.doi.org/10.48550/arXiv.2304.12102
  11. B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” International Conference on Machine Learning (ICML), 2022. [Online]. Available: https://arxiv.org/abs/2110.10090
  12. M. Hahn, “Theoretical Limitations of Self-Attention in Neural Sequence Models,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, 2020. [Online]. Available: https://doi.org/10.1162/tacl_a_00306
  13. M. Bhan, N. Achache, V. Legrand, A. Blangero, and N. Chesneau, “Evaluating self-attention interpretability through human-grounded experimental protocol,” in Explainable Artificial Intelligence, L. Longo, Ed.   Cham: Springer Nature Switzerland, 2023, pp. 26–46.
  14. Z. Tao, X. Liu, Y. Xia, X. Wang, L. Yang, X. Huang, and T.-S. Chua, “Self-supervised learning for multimedia recommendation,” IEEE Transactions on Multimedia, 2022. [Online]. Available: https://dx.doi.org/10.1109/TMM.2022.3187556
  15. D. Patrick, M. Geyer, R. Tran, and A. Fernandez, “Reconstructive training for real-world robustness in image classification,” in IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022. [Online]. Available: https://dx.doi.org/10.1109/WACVW54805.2022.00031
  16. F. M. Yıldırım, A. Kaya, S. Öztürk, and D. Kılınç, “A real-world text classification application for an e-commerce platform,” in International Symposium on Advanced Electrical and Communication Technologies (ISAECT), 2019. [Online]. Available: https://dx.doi.org/10.1109/ASYU48272.2019.8946337
  17. W. Jin, X. Li, and G. Hamarneh, “Rethinking ai explainability and plausibility,” arXiv preprint arXiv:2303.17707, 2023. [Online]. Available: http://arxiv.org/pdf/2303.17707
  18. W. You, S. Sun, and M. Iyyer, “Hard-coded gaussian attention for neural machine translation,” in Annual Meeting of the Association for Computational Linguistics, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218487704
  19. M. Guo, Y. Zhang, and T. Liu, “Gaussian transformer: A lightweight approach for natural language inference,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 6489–6496, Jul. 2019. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/4614
  20. D. Ruan, D. Wang, Y. Zheng, N. Zheng, and M. Zheng, “Gaussian context transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 15 129–15 138.
  21. J. Kim, M. El-Khamy, and J. Lee, “T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6649–6653.
  22. A. Luo, F. Yang, X. Li, L. Nie, C. Lin, H. Fan, and S. Liu, “Gaflow: Incorporating gaussian attention into optical flow,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 9642–9651.
  23. C. Chen and B. Li, “An interpretable channelwise attention mechanism based on asymmetric and skewed gaussian distribution,” Pattern Recognition, vol. 139, p. 109467, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S003132032300167X
  24. J. Xie, Z. Ma, D. Chang, G. Zhang, and J. Guo, “Gpca: A probabilistic framework for gaussian process embedded channel attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8230–8248, 2022.
  25. J. Fluri, T. Kacprzak, A. Lucchi, A. Schneider, A. Réfrégier, and T. Hofmann, “Full w𝑤witalic_wcdm analysis of kids-1000 weak lensing maps using deep learning,” Physical Review D, 2022.
  26. J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 4895–4901. [Online]. Available: https://aclanthology.org/2023.emnlp-main.298
  27. N. Shazeer, “Fast transformer decoding: One write-head is all you need,” 2019.
  28. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:52967399
  29. J. Kahn and M. R. et al., “Libri-light: A benchmark for asr with limited or no supervision,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  30. G. Chen and S. C. et al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” in Interspeech, 2021.
  31. C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in International Joint Conference on Natural Language Processing, 2021.
  32. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 12 2015. [Online]. Available: https://doi.org/10.1007/s11263-015-0816-y
  33. G. Ioannides, M. Owen, A. Fletcher, V. Rozgic, and C. Wang, “Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1853–1857.
  34. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9.   Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256. [Online]. Available: https://proceedings.mlr.press/v9/glorot10a.html
  35. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
  36. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” 2018.
  37. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower Provost, S. Kim, J. Chang, S. Lee, and S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, 2008.
  38. X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28.   Curran Associates, Inc., 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf
  39. A. Krizhevsky, “Learning multiple layers of features from tiny images,” Canadian Institute For Advanced Research, Tech. Rep., 2009.
  40. Y. Wang, Q. Shi, and T.-H. Chang, “Batch normalization damages federated learning on non-iid data: Analysis and remedy,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  41. R. Jaiswal and D. Romero, “Implicit wiener filtering for speech enhancement in non-stationary noise,” in 2021 11th International Conference on Information Science and Technology (ICIST), 2021, pp. 39–47.
  42. G. Lovisotto, N. Finnie, M. Muñoz, C. K. Mummadi, and J. H. Metzen, “Give me your attention: Dot-product attention considered harmful for adversarial patch robustness,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 213–15 222, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247748735
  43. Q. Zhang, S. Zuo, C. Liang, A. Bukharin, P. He, W. Chen, and T. Zhao, “Platon: Pruning large transformer models with upper confidence bound of weight importance,” 2022.
  44. K. Li, J. Li, D. Guo, X. Yang, and M. Wang, “Transformer-based visual grounding with cross-modality interaction,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 19, no. 6, may 2023. [Online]. Available: https://doi.org/10.1145/3587251
  45. J. Back, N. Ahn, and J. Kim, “Magnitude attention-based dynamic pruning,” 2023.

Summary

  • The paper introduces GAAM, which uses Gaussian modulation to dynamically recalibrate feature importance and enhance attention precision.
  • It integrates seamlessly with dot-product attention, enabling efficient fine-tuning and addressing non-stationarity across multiple data modalities.
  • The study presents the Importance Factor metric to boost model explainability by directly linking learned parameters with feature significance.

Introduction

The attention mechanism has become a cornerstone of transformative models, particularly in natural language processing, speech signal processing, and digital image processing. Despite ubiquitous brilliance, traditional self-attention mechanisms in Transformer architectures face limitations, including inefficiencies with long-range dependencies and potential issues in interpretability. Researchers have attempted to enhance these mechanisms to better capture contextual significance within data sequences, an endeavor that has led to the development of new techniques such as Gaussian Adaptive Attention.

Innovations in Attention Mechanisms

The introduction of the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), implemented through the Gaussian Adaptive Transformer (GAT), marks a pivotal shift in attention-based models. GAAM distinguishes itself by employing Gaussian modulation to recalibrate feature importance dynamically, allowing the attention mechanism to act flexibly with input features. By learning both mean and variance parameters in a multi-headed fashion, GAAM invites an approach that can efficiently model any probability distribution, thus addressing core challenges like non-stationarity in data.

Moreover, the paper underlines GAAM's adaptability and enhancement capabilities as it integrates into existing dot-product attention frameworks. This seamless pairing between the probabilistic nature of GAAM and the precision of dot-product attention lays the groundwork for further innovation and optimization of attention mechanisms across various data modalities.

Multimodality and Explainability

GAAM's robustness extends across speech, text, and visual modalities, which is a testament to its versatile design. It particularly excels in areas where data exhibit high non-stationarity, as it can adaptively focus and discern feature significance within rapidly changing contextual environments. Confirming its broad applicability, GAAM has demonstrated substantial improvements in tasks such as speech emotion recognition and classification tasks in both image and text domains.

Adjunct to performance improvement is the mechanism's contribution to explainability—a critical aspect of AI model acceptance and trust. The new Importance Factor (IF) metric proposed alongside GAAM equips users with the ability to glimpse into the model's decision-making process, tying feature significance directly to GAAM's learned parameters.

Performance and Practicality

The results described in the paper solidify GAAM's position at the forefront of cutting-edge research within the field of attention mechanisms. The paper delineates extensive experiments showcasing the superiority of GAAM and its applicability to multiple modalities, thus expanding the landscape of potential real-world applications. Its capacity to function in harmony with pre-existing Transformer models by enhancing them with richer, contextually adaptive attention capabilities also promises a new wave of improvements in model performance.

While the GAAM's prowess is clear, a thoughtful examination of its dynamism across different layers of encoder models is prudent. Layers with high Importance Factor scores are consistently aligned with superior model performance, which signifies that GAAM is not merely a modification but an integral piece in the evolution of attention-based modeling.

Final Thoughts

Gaussian Adaptive Attention represents a significant leap toward developing models that are as dynamic and nuanced as the contexts they aim to interpret. The GAAM framework, bolstered by the Gaussian Adaptive Transformer, elevates the effectiveness, efficiency, and explicability of attention mechanisms, yielding models that respond with greater precision and adaptability to the rich tapestry of multimodal data inputs.

Youtube Logo Streamline Icon: https://streamlinehq.com