Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lightweight Cross-Modal Representation Learning (2403.04650v3)

Published 7 Mar 2024 in cs.LG and cs.AI

Abstract: Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representation Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modalities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “A review of multimodal fusion for video understanding,” International Journal of Computer Vision, vol. 128, pp. 1039–1052, 2020.
  2. T. Chen, Y. Xu, T. Luo, C. Li, Y. Xu, and D. Tao, “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 6094–6101, 2018.
  3. N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in neural information processing systems, pp. 2222–2230, 2012.
  4. I. J. Orton, “Vision based body gesture meta features for affective computing,” arXiv preprint arXiv:2003.00809, 2020.
  5. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, T. Drechsler, et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
  6. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  7. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  8. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011.
  9. P. Gao, W. Zhang, J. Zhang, Y. Li, R. Zhang, and Y. Sun, “A review of multimodal fusion for visual question answering and visual grounding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  10. D. Kiela, L. Bottou, and Y. Bengio, “Learning cross-lingual word embeddings with neural language models,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1001–1011, 2014.
  11. Y. Zhang, Z. Zhang, R. Ji, X. Gao, and Q. Tian, “Recent advances in multimodal fusion for video understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  12. D. Hazarika, S. Poria, A. Z. Zadeh, E. Cambria, R. Zimmermann, and L.-P. Morency, “Convolutional neural network for text-based emotion recognition in conversations,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3688–3697, 2018.
  13. D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “Icon: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2594–2604, 2018.
  14. S. Poria, E. Cambria, D. Hazarika, N. Majumder, and A. Zadeh, “A review of multimodal sentiment analysis,” Journal of Intelligent Systems, vol. 26, no. 1, pp. 3–27, 2017.
  15. J. Liu, A. Shahroudy, D. Xu, G. Wang, and L. Yu, “Skeleton-based action recognition with convolutional neural networks,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7370–7378, IEEE, 2017.
  16. Z. Zhang, M. Xu, P. Liu, X. Liu, Y. Wang, Q. Yang, and H. Ji, “Multimodal fusion for emotion recognition from text and speech,” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1401–1409, 2018.
  17. J. Zhang, Y. Li, Z. Wu, Y. Liu, and Y. Wang, “Multimodal deep learning for robust rgb-d object recognition,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1346–1354, IEEE, 2017.
  18. J. Liu, Y. Zhang, H. Ji, and H. Chen, “Multimodal fusion for emotion recognition from text and speech,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2157–2168, 2020.
  19. X. Liu, S. Wang, J. Wang, and K. Li, “Hybrid deep learning for music genre classification,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp. 1231–1240, ACM, 2019.
  20. S. Wang, Y. Zhang, J. Liu, W. Wu, and H.-T. Zheng, “Hybrid hierarchical attention network for text classification,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 735–744, ACM, 2020.
  21. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1597–1607, 2020.
  22. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186, 2019.
  23. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Å. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
  24. L. Zhang, M. Á. Carreira-Perpiñán, and A. Lavie, “Convirt: Contrastive visual representation learning for medical imaging,” IEEE Transactions on Medical Imaging, vol. 41, no. 2, pp. 519–531, 2022.
  25. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
  26. J. Chung, B. Zoph, A. Srinivas, L. Zhang, W. Yang, B. Zhang, A. Howard, M. Á. Carreira-Perpiñán, and E. D. Cubuk, “Align: Activating latent innovations for multimodal contrastive learning,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
  27. W. Liu, L. Zhang, and M. Á. Carreira-Perpiñán, “Glip: Generic latent innovations propagation for multimodal contrastive learning,” in International Conference on Learning Representations, 2022.
  28. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
  29. J. Lu, V. Goswami, D. Parikh, and S. Lee, “Virtex: Learning visual representations from textual annotations,” in European Conference on Computer Vision, pp. 183–199, Springer, 2020.
  30. X. Li, J. Cheng, J. Yang, R. Yan, P. Xu, F. Dong, X. Sun, and J. Gao, “Prefix-tuning: Optimizing continuous prompts for generation,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
  31. C. Li, X. Wei, Y. Zhai, J. Li, J. Gao, and J. Chen, “Visualgpt: Data-efficient image generation using guided variational autoencoders,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9707–9715, 2021.
  32. J. Smith and et al., “Multimodal fusion techniques for image and text data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  33. A. Johnson and et al., “Cross-modal attention mechanisms for multimodal embedding,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
  34. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
  35. M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015.
  36. K. Xu and et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proceedings of the International Conference on Machine Learning (ICML), 2015.
  37. Z. Yang and et al., “Hierarchical attention networks for document classification,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2016.
  38. A. van den Oord and et al., “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  39. C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “You only learn one representation: Unified network for multiple tasks,” arXiv preprint arXiv:2105.04206, 2021.
  40. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” Lecture Notes in Computer Science, vol. 8693, pp. 740–755, 2014.
  41. W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Association for Computational Linguistics, 2005.
  42. W. Chen, “Quora question pairs,” in Quora Dataset, 2018.
  43. D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity—multilingual and crosslingual focused evaluation,” arXiv preprint arXiv:1708.00055, 2017.
  44. A. Krizhevsky and G. Hinton, “Cifar-10 (canadian institute for advanced research),” tech. rep., Technical Report, 2009.
  45. A. Krizhevsky and G. Hinton, “Cifar-100 (canadian institute for advanced research),” tech. rep., Technical Report, 2009.
  46. F.-F. Li, A. Karpathy, and J. Johnson, “Cs231n convolutional neural networks for visual recognition.” http://cs231n.stanford.edu/, 2018.
  47. “Flickr8k.” http://hockenmaier.cs.illinois.edu/8k-pictures.html.
  48. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  49. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert approach,” arXiv preprint arXiv:1907.11692, 2019.
  50. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, B. Steiner, M. Hein, H. Touvron, A. Hervieu, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  51. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bilal Faye (10 papers)
  2. Hanane Azzag (18 papers)
  3. Mustapha Lebbah (30 papers)
  4. Djamel Bouchaffra (6 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets