Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
90 tokens/sec
Gemini 2.5 Pro Premium
54 tokens/sec
GPT-5 Medium
19 tokens/sec
GPT-5 High Premium
18 tokens/sec
GPT-4o
104 tokens/sec
DeepSeek R1 via Azure Premium
78 tokens/sec
GPT OSS 120B via Groq Premium
475 tokens/sec
Kimi K2 via Groq Premium
225 tokens/sec
2000 character limit reached

Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI (2312.07886v1)

Published 13 Dec 2023 in cs.AI and cs.CL

Abstract: LLMs are capable of reasoning over diverse input data modalities through pre-trained encoders. However, the growing diversity of input data modalities prevents incorporating all modalities into LLMs, especially when LLMs are deployed on resource-constrained edge devices for embodied AI applications. Instead, a better option is to adaptively involve only the useful modalities at runtime, depending on the current environmental contexts and task requirements. For such modality adaptation, existing work adopts fixed connections between encoders and the LLM's input layer, leading to high training cost at runtime and ineffective cross-modal interaction. In this paper, we address these limitations by presenting mPnP-LLM, a new technique that allows fully elastic, automated and prompt runtime modality adaptation, by connecting unimodal encoders to a flexible set of last LLM blocks and making such latent connections fully trainable at runtime. Experiments over the nuScenes-QA dataset show that mPnP-LLM can achieve up to 3.7x FLOPs reduction and 30% GPU memory usage reduction, while retaining on-par accuracy with the existing schemes. Under the same compute budget, mPnP-LLM improves the task accuracy by up to 4% compared to the best existing scheme.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Hachibot. https://hachibot.com/index_en.html.
  2. Jetbot. https://developer.nvidia.com/embedded/learn/jetbot.
  3. Meituan autonomous delivery. https://developer.nvidia.com/downloads/embedded/success-stories/showcases/meituan.
  4. Nvidia agx orin. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/.
  5. Skydio 2. https://developer.nvidia.com/blog/skydio-2-jetson-tx2-drone/.
  6. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  7. Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5240–5250, 2023.
  8. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  9. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  10. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  15. Modulating early visual processing by language. Advances in Neural Information Processing Systems, 30, 2017.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  18. Teaching structured vision & language concepts to vision & language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657–2668, 2023.
  19. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  20. Apg: Audioplethysmography for cardiac monitoring in hearables. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, pages 1–15, 2023.
  21. Network pruning via performance maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9270–9280, 2021.
  22. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  23. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  24. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  25. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  26. D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  27. Masked autoencoder for self-supervised pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 350–359, 2023.
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  29. Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV), pages 53–69, 2018.
  30. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  31. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  32. Bioscatter: Low-power sweat sensing with backscatter. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, pages 191–204, 2023.
  33. m3track: mmwave-based multi-user 3d posture tracking. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pages 491–503, 2022.
  34. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  36. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023b.
  37. X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  38. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  39. Ti-mae: Self-supervised masked time series autoencoders. arXiv preprint arXiv:2301.08871, 2023c.
  40. Modular and parameter-efficient multimodal fusion with prompting. arXiv preprint arXiv:2203.08055, 2022.
  41. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
  42. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
  44. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  45. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  46. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  47. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  48. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  49. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
  50. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  51. ep-alm: Efficient perceptual augmentation of language models. arXiv preprint arXiv:2303.11403, 2023.
  52. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 399–402, 2005.
  53. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
  54. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  55. H. Tan and M. Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  56. Tvlt: Textless vision-language transformer. Advances in Neural Information Processing Systems, 35:9617–9632, 2022.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  59. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  60. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  61. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34:22419–22430, 2021.
  62. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  63. Mask does not matter: Anti-spoofing face authentication using mmwave without on-site registration. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, pages 310–323, 2022.
  64. Bridgetower: Building bridges between encoders in vision-language representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10637–10647, 2023.
  65. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  66. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  67. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023.
  68. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  69. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.