LLMBind: A Unified Modality-Task Integration Framework (2402.14891v5)
Abstract: In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress. To address this challenge, we introduce \textbf{LLMBind}, a novel framework designed to unify a diverse array of multi-modal tasks. By harnessing a Mixture-of-Experts (MoE) LLM, LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks. This unique approach empowers LLMBind to interpret inputs and generate outputs across various modalities, including image, text, video, and audio. Furthermore, we have constructed an interaction dataset comprising 400k instructions, which unlocks the ability of LLMBind for interactive visual generation and editing tasks. Extensive experimentation demonstrates that LLMBind achieves very superior performance across diverse tasks and outperforms existing models in user evaluations conducted in real-world scenarios. Moreover, the adaptability of LLMBind allows for seamless integration with the latest models and extension to new modality tasks, highlighting its potential to serve as a unified AI agent for modeling universal modalities.
- Network of experts for large-scale image categorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp. 516–532. Springer, 2016.
- Flamingo: a visual language model for few-shot learning. In Proceedings of the NeurIPS, 2022.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. CoRR, abs/2304.08477, 2023.
- Text2live: Text-driven layered image and video editing. In European conference on computer vision, pp. 707–723. Springer, 2022.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402, 2023.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
- Vicuna: An open-source chatbot impressing gpt-4 with 90 2023.
- Vision-language transformer and query generation for referring segmentation. In ICCV, 2021a.
- Cogview: Mastering text-to-image generation via transformers. In Proceedings of the NeurIPS, pp. 19822–19835, 2021b.
- Loramoe: Revolutionizing mixture of ex-perts for maintaining world knowledge in language model alignment. arXiv preprint arXiv:2312.09979, 2023.
- Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
- Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
- Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023.
- Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6865–6873, 2017.
- Partimagenet: A large, high-quality dataset of parts. In ECCV, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. CoRR, abs/2205.15868, 2022a.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022b.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
- Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017, 2023.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the NAACL, pp. 119–132, 2019.
- Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Proceedings of the ECCV, pp. 740–755, 2014.
- Gres: Generalized referring expression segmentation. In CVPR, 2023a.
- Audioldm: Text-to-audio generation with latent diffusion models. In Proceedings of the ICML, pp. 21450–21474, 2023b.
- Visual instruction tuning. CoRR, abs/2304.08485, 2023c.
- Visual instruction tuning. arXiv:2304.08485, 2023d.
- Decoupled weight decay regularization. arXiv:1711.05101, 2017.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- mistralai. Introducing mixtrial-8x7b. 2023.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15169–15179, 2023.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the ICML, pp. 16784–16804, 2022.
- Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Paco: Parts and attributes of common objects. In CVPR, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD, 2020.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the CVPR, pp. 10674–10685, 2022a.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022b.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Make-a-video: Text-to-video generation without text-video data. CoRR, abs/2209.14792, 2022.
- Any-to-any generation via composable diffusion. CoRR, abs/2305.11846, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
- Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence, pp. 552–562. PMLR, 2020.
- Cris: Clip-driven referring image segmentation. In CVPR, 2022.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- Llmga: Multimodal large language model based generation assistant. arXiv preprint arXiv:2311.16500, 2023.
- MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR, pp. 5288–5296, 2016.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023a.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023b.
- Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19187–19197, 2023c.
- Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18155–18165, 2022.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Mini-dalle3: Interactive text to image by prompting large language models. arXiv preprint arXiv:2310.07653, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023a.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023b.
- Generalized decoding for pixel, image, and language. In CVPR, 2023a.
- Segment everything everywhere all at once. arXiv:2304.06718, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.