Explaining latent representations of generative models with large multimodal models (2402.01858v3)
Abstract: Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. With the rise of the large multimodal model, it can align images with text to generate answers. In this work, we propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model. We further measure the uncertainty of our generated explanations, quantitatively evaluate the performance of explanation generation among multiple large multimodal models, and qualitatively visualize the variations of each latent variable to learn the disentanglement effects of different generative models on explanations. Finally, we discuss the explanatory capabilities and limitations of state-of-the-art large multimodal models.
- Beyond efficiency: A systematic survey of resource-efficient large language models, 2024.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
- 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018.
- Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Learning data representations with joint diffusion models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 543–559. Springer, 2023.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
- Language models (mostly) know what they know, 2022.
- Auto-encoding variational bayes. International conference on learning representations, 2014.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
- MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
- Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- Domain specialization as the key to make large language models disruptive: A comprehensive survey, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
- Diffusion based representation learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 24963–24982. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/mittal23a.html.
- Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
- Dot-vae: Disentangling one factor at a time. In International Conference on Artificial Neural Networks, pp. 109–120. Springer, 2022.
- Stochastic backpropagation and approximate inference in deep generative models, 2014.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024.
- Multimodal large language models: A survey, 2023.
- A survey on multimodal large language models, 2023.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Mengdan Zhu (6 papers)
- Zhenke Liu (5 papers)
- Bo Pan (31 papers)
- Abhinav Angirekula (3 papers)
- Liang Zhao (353 papers)