Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

Text-centric Alignment for Multi-Modality Learning (2402.08086v2)

Published 12 Feb 2024 in cs.LG, cs.CL, and cs.CV

Abstract: This research paper addresses the challenge of modality mismatch in multimodal learning, where the modalities available during inference differ from those available at training. We propose the Text-centric Alignment for Multi-Modality Learning (TAMML) approach, an innovative method that utilizes LLMs with in-context learning and foundation models to enhance the generalizability of multimodal systems under these conditions. By leveraging the unique properties of text as a unified semantic space, TAMML demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations. TAMML not only adapts to varying modalities but also maintains robust performance, showcasing the potential of foundation models in overcoming the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible, effective solution for real-world applications where modality availability is dynamic and uncertain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. 2023. Inside Airbnb : Hawaii. http://insideairbnb.com/get-the-data Accessed on: 10 September, 2023.
  2. Image2stylegan++: How to edit the embedded images?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8296–8305.
  3. Mongrel Jedi Addison Howard, MichaelApers. 2018. PetFinder.my Adoption Prediction. https://kaggle.com/competitions/petfinder-adoption-prediction
  4. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 248–265.
  5. Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719–728.
  6. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38, 7 (2015), 1425–1438.
  7. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
  8. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
  9. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 (2016).
  10. Generating visual representations for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2666–2673.
  11. Boosting Audio-visual Zero-shot Learning with Large Language Models. arXiv preprint arXiv: 2311.12268 (2023).
  12. Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4054–4062.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
  14. Yun Da Tsai and Shou De Lin. 2022. Fast online inference for nonlinear contextual bandit based on Generative Adversarial Network. arXiv preprint arXiv:2202.08867 (2022).
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020. arXiv preprint arXiv:2010.11929 (2010).
  16. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems 34 (2021), 18932–18943.
  17. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics. PMLR, 5549–5581.
  18. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  19. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision 127 (2019), 1767–1779.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
  21. Denoising diffusion restoration models. Advances in Neural Information Processing Systems 35 (2022), 23593–23606.
  22. Unseen Classes at a Later Time? No Problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9245–9254.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  24. Visual Instruction Tuning.
  25. Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2498–2512.
  26. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  27. SDEdit:Guided image synthesis and editing with stochastic differential equations. International Conference on Learning Representations (2022).
  28. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395–20405.
  29. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning. PMLR, 3918–3926.
  30. OpenAI. 2023. OpenAI models api. (2023). https://platform.openai.com/docs/models
  31. Zero-shot learning with semantic output codes. Advances in neural information processing systems 22 (2009).
  32. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824 (2023).
  33. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  34. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems 32 (2019).
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  36. Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11254–11264.
  37. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
  38. Denoising Diffusion Implicit Models. arXiv:2010.02502 (October 2020). https://arxiv.org/abs/2010.02502
  39. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594 (2023).
  40. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  41. Rtlfixer: Automatically fixing rtl syntax errors with large language models. arXiv preprint arXiv:2311.16543 (2023).
  42. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167–7176.
  43. Zero-shot learning via class-conditioned deep generative models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  44. One-shot learning for landmarks detection. In Deep Generative Models, and Data Augmentation, Labelling, and Imperfections: First Workshop, DGM4MICCAI 2021, and First Workshop, DALI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, October 1, 2021, Proceedings 1. Springer, 163–172.
  45. Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. arXiv preprint arXiv:2301.13743 (2023).
  46. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  47. A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092 (2024).
  48. A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549 (2023).
  49. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024).
  50. In-domain gan inversion for real image editing. In European conference on computer vision. Springer, 592–608.
Citations (7)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets