Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 159 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models (2403.15226v2)

Published 22 Mar 2024 in cs.MM and cs.CL

Abstract: In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal LLMs (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Flamingo: a visual language model for few-shot learning. Adv. Neural Inform. Process. Syst., 35:23716–23736, 2022.
  2. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inform. Process. Syst., 35:32897–32912, 2022.
  3. Language models are few-shot learners. Adv. Neural Inform. Process. Syst., 33:1877–1901, 2020.
  4. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp.  1931–1942. PMLR, 2021.
  5. Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314, 2023.
  6. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.  2793–2803. PMLR, 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021.
  8. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  18166–18176, 2022.
  9. Clip-adapter: Better vision-language models with feature adapters. IJCV, pp.  1–15, 2023a.
  10. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
  11. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6325–6334, 2017a.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pp.  6904–6913, 2017b.
  13. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations (Int. Conf. Learn. Represent.), 2022.
  14. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  15. LoRA: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent., 2022.
  16. Instant soup: Cheap pruning ensembles in A single pass can draw lottery tickets from large models. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  14691–14701. PMLR, 2023.
  17. Visual prompt tuning. In European Conference on Computer Vision (ECCV), pp.  709–727, 2022a.
  18. Visual prompt tuning. In ECCV, pp.  709–727. Springer, 2022b.
  19. Compacter: Efficient low-rank hypercomplex adapter layers. Adv. Neural Inform. Process. Syst., 34:1022–1035, 2021.
  20. I-BERT: integer-only BERT quantization. In International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  5506–5518. PMLR, 2021a. URL http://proceedings.mlr.press/v139/kim21d.html.
  21. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp.  5583–5594. PMLR, 2021b.
  22. Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  23. Flexround: Learnable rounding based on element-wise division for post-training quantization. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  18913–18939. PMLR, 2023.
  24. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inform. Process. Syst., 34:9694–9705, 2021.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
  27. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  28. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics (ACL), pp.  4582–4597, 2021a.
  29. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, pp.  4582–4597, 2021b.
  30. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  31. Oscillation-free quantization for low-bit vision transformers. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  21813–21824. PMLR, 2023b.
  32. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inform. Process. Syst., 32, 2019.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. Adv. Neural Inform. Process. Syst., 35:2507–2521, 2022.
  34. Towards lightweight transformer via group-wise transformation for vision-and-language tasks. TIP, 31:3386–3398, 2022.
  35. Towards efficient visual adaption via structural re-parameterization. arXiv preprint arXiv:2302.08106, 2023a.
  36. Cheap and quick: Efficient vision-language instruction tuning for large language models. In Adv. Neural Inform. Process. Syst., 2023b.
  37. Sparse and continuous attention mechanisms. Adv. Neural Inform. Process. Syst., 33:20989–21001, 2020.
  38. Pruning filter in filter. Adv. Neural Inform. Process. Syst., 33:17629–17640, 2020.
  39. Gradient-free structured pruning with unlabeled data. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  26326–26341. PMLR, 2023.
  40. OpenAI, R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  41. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision (IJCV), pp.  74–93, 2017.
  42. Upop: Unified and progressive pruning for compressing vision-language transformers. arXiv preprint arXiv:2301.13741, 2023.
  43. Vl-bert: Pre-training of generic visual-linguistic representations. In Int. Conf. Learn. Represent., 2019.
  44. A corpus for reasoning about natural language grounded in photographs. In Annual Meeting of the Association for Computational Linguistics (ACL), pp.  6418–6428, 2019a.
  45. A corpus for reasoning about natural language grounded in photographs. In ACL, pp.  6418–6428, 2019b.
  46. VL-ADAPTER: parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5217–5227, 2022a.
  47. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, pp.  5227–5237, 2022b.
  48. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, pp.  5100–5111, 2019.
  49. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inform. Process. Syst., 34:24261–24272, 2021.
  50. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  51. Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix. In International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  22680–22690. PMLR, 2022.
  52. Approximated prompt tuning for vision-language pre-trained models. arXiv preprint arXiv:2306.15706, 2023a.
  53. Parameter and computation efficient transfer learning for vision-language pre-trained models. In Adv. Neural Inform. Process. Syst., 2023b.
  54. Understanding int4 quantization for language models: Latency speedup, composability, and failure cases. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  37524–37539. PMLR, 2023c.
  55. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  38087–38099. PMLR, 2023.
  56. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  57. Opt: Open pre-trained transformer language models. URL https://arxiv. org/abs/2205.01068, 2022.
  58. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: