Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cloud-Device Collaborative Learning for Multimodal Large Language Models (2312.16279v1)

Published 26 Dec 2023 in cs.CV

Abstract: The burgeoning field of Multimodal LLMs (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning, commonsense reasoning, and visual scene understanding. However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable decline in generalization capabilities when these models are compressed for device deployment. Addressing this challenge, we introduce a Cloud-Device Collaborative Continual Adaptation framework, designed to enhance the performance of compressed, device-deployed MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment. In the uplink phase, we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively filter out-of-distribution tokens, thereby reducing transmission costs and improving training efficiency. On the cloud side, we propose Adapter-based Knowledge Distillation (AKD) method to transfer refined knowledge from large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic Weight update Compression (DWC) strategy for the downlink, which adaptively selects and quantizes updated weight parameters, enhancing transmission efficiency and reducing the representational disparity between cloud and device models. Extensive experiments on several multimodal benchmarks demonstrate the superiority of our proposed framework over prior Knowledge Distillation and device-cloud collaboration methods. Notably, we also validate the feasibility of our approach to real-world experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. OpenAI. GPT-4 technical report, 2023.
  2. Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning. 2022.
  3. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
  4. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
  5. Continual test-time domain adaptation. In Proceedings of Conference on Computer Vision and Pattern Recognition, 2022a.
  6. Continual adaptation of visual representations via domain randomization and meta-learning. computer vision and pattern recognition, 2020.
  7. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023.
  8. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  9. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  10. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  11. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  12. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  13. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
  14. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  15. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
  16. Parameter-efficient transfer learning for nlp, 2019.
  17. Network offloading policies for cloud robotics: a learning-based approach, 2019.
  18. Clipper: A low-latency online prediction serving system, 2017.
  19. Salsify: Low-Latency network video through tighter integration between a video codec and a transport protocol. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 267–282, Renton, WA, April 2018. USENIX Association. ISBN 978-1-939133-01-4. URL https://www.usenix.org/conference/nsdi18/presentation/fouladi.
  20. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Computer Architecture News, 45(1):615–629, 2017.
  21. Cloud-device collaborative adaptation to continual changing environments in the real-world, 2022.
  22. Masked distillation with receptive tokens. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=mWRngkvIki3.
  23. Knowledge diffusion for distillation. NeurIPS, 30, 2023b.
  24. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In ICLR, 2020.
  25. Cross-image relational knowledge distillation for semantic segmentation. In CVPR, pages 12319–12328, 2022a.
  26. Focal and global knowledge distillation for detectors. In CVPR, pages 4643–4652, 2022b.
  27. Avatar knowledge distillation: Self-ensemble teacher paradigm with uncertainty. arXiv preprint arXiv:2305.02722, 2023b.
  28. Tent: Fully test-time adaptation by entropy minimization, 2021.
  29. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation, 2021.
  30. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9641–9650, 2020.
  31. Source-free domain adaptation for semantic segmentation, 2021.
  32. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  34. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  36. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  37. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  38. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  39. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  40. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  41. Pkd: General distillation framework for object detectors via pearson correlation coefficient. Advances in Neural Information Processing Systems, 35:15394–15406, 2022.
  42. Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5311–5320, 2021.
  43. Distilling the knowledge in a neural network. NeurIPS Workshop, 2014.
Citations (6)

Summary

We haven't generated a summary for this paper yet.