Rethinking Multi-view Representation Learning via Distilled Disentangling (2403.10897v2)
Abstract: Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.
- Deep canonical correlation analysis. In ICML, pages 1247–1255. PMLR, 2013.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. NIPS, 33:12449–12460, 2020.
- Feng Bao. Disentangled variational information bottleneck for multiview representation learning. In CICAI, pages 91–102. Springer, 2021.
- Mutual information neural estimation. In ICML, pages 531–540. PMLR, 2018.
- Representation learning: A review and new perspectives. PAMI, 35(8):1798–1828, 2013.
- Multi-view low-rank sparse subspace clustering. Pattern Recognition, 73:247–258, 2018.
- Consensus and complementarity based maximum entropy discrimination for multi-view classification. Information Sciences, 367:296–310, 2016.
- Mm-vit: Multi-modal video transformer for compressed video action recognition. In CVPR, pages 1910–1921, 2022.
- Club: A contrastive log-ratio upper bound of mutual information. In ICML, pages 1779–1788. PMLR, 2020.
- Histograms of oriented gradients for human detection. In CVPR, pages 886–893. IEEE, 2005.
- Emilien Dupont. Learning disentangled joint continuous and discrete representations. NIPS, 31, 2018.
- Learning robust representations via multi-view information bottleneck. ICLR, 2020.
- Songhao Piao Furu Wei Hangbo Bao, Li Dong. Beit: BERT pre-training of image transformers. In ICLR, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2016.
- Deep spectral representation learning from multi-view data. TIP, 30:5352–5362, 2021.
- Describing videos using multi-modal fusion. In ACMMM, pages 1087–1091, 2016.
- Conan: contrastive fusion networks for multi-view clustering. In Big Data, pages 653–660. IEEE, 2021.
- Disentangling multi-view representations beyond inductive bias. In ACMMM, pages 2582–2590, 2023.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, page 2, 2019.
- Auto-encoding variational bayes. In ICLR, 2014.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Co-regularized multi-view spectral clustering. NIPS, 24:1413–1421, 2011.
- ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR, 2020.
- A survey of multi-view representation learning. TKDE, 31(10):1863–1883, 2019.
- Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, 2023.
- Multi-view clustering via joint nonnegative matrix factorization. In ICDM, pages 252–260. SIAM, 2013.
- Coupled generative adversarial networks. NIPS, 29, 2016.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157. Ieee, 1999.
- Columbia object image library (coil-20). 1996.
- Pytorch: An imperative style, high-performance deep learning library. NIPS, 32:8026–8037, 2019.
- A new approach to cross-modal multimedia retrieval. In ACMMM, pages 251–260, 2010.
- Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
- Multi-view maximum entropy discrimination. In IJCAI, pages 1706–1712, 2013.
- Self-supervised deep multi-view subspace clustering. In ACML, pages 1001–1016. PMLR, 2019.
- Contrastive multiview coding. In ECCV, pages 776–794. Springer, 2020.
- Reconsidering representation alignment for multi-view clustering. In CVPR, pages 1255–1265, 2021.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. NIPS, 30, 2017.
- Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103, 2008.
- On deep multi-view representation learning. In ICML, pages 1083–1092. PMLR, 2015.
- Multi-view subspace clustering with intactness-aware similarity. Pattern Recognition, 88:50–63, 2019.
- Disentangled representation learning. arXiv preprint arXiv:2211.11695, 2022.
- Knowledge distillation-driven semi-supervised multi-view classification. Information Fusion, 103:102098, 2024.
- Multiview spectral clustering via structured low-rank matrix factorization. TNNLS, 29(10):4833–4843, 2018.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, pages 14668–14678, 2022.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Progressive deep multi-view comprehensive representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10557–10565, 2023a.
- Multi-vae: Learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In CVPR, pages 9234–9243, 2021.
- Multi-level feature learning for contrastive multi-view clustering. In CVPR, pages 16051–16060, 2022.
- Untie: Clustering analysis with disentanglement in multi-view information fusion. Information Fusion, 100:101937, 2023b.
- Gcfagg: Global and cross-view feature aggregation for multi-view clustering. In CVPR, pages 19863–19872, 2023.
- Ae2-nets: Autoencoder in autoencoder networks. In CVPR, pages 2577–2585, 2019.
- Multi-view clustering via deep matrix factorization. In AAAI, 2017.
- End-to-end adversarial-attention network for multi-modal clustering. In CVPR, pages 14619–14628, 2020.
- End-to-end multi-view fusion for 3d object detection in lidar point clouds. In CoRL, pages 923–932. PMLR, 2020.