XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning (2211.13929v5)
Abstract: We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by $8\%$ to $14\%$ on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by $5.5\%$ on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of $96.5\%$.
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
- Asr is all you need: Cross-modal distillation for lip reading. In ICASSP, 2143–2147. IEEE.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34.
- Self-Supervised MultiModal Versatile Networks. NeurIPS, 2(6): 7.
- Emotion recognition in speech using cross-modal transfer in the wild. In ACM Multimedia, 292–301.
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS, 33.
- Look, listen and learn. In ICCV, 609–617.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS.
- Soundnet: Learning sound representations from unlabeled video. NeurIPS, 29.
- Mae-ast: Masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691.
- MultiMAE: Multi-modal Multi-task Masked Autoencoders. arXiv preprint arXiv:2204.01678.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
- Emerging properties in self-supervised vision transformers. In ICCV, 9650–9660.
- A simple framework for contrastive learning of visual representations. In ICML, 1597–1607.
- Exploring simple siamese representation learning. In CVPR, 15750–15758.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
- Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 7016–7025.
- Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. arXiv preprint arXiv:2204.12768.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
- Learning an augmented rgb representation with cross-modal knowledge distillation for action detection. In ICCV, 13053–13064.
- Scaling vision transformers to 22 billion parameters. In ICML, 7480–7512. PMLR.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113.
- A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 3299–3309.
- FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 776–780.
- Imagebind: One embedding space to bind them all. In CVPR, 15180–15190.
- Omnivore: A single model for many visual modalities. In CVPR, 16102–16112.
- Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.
- Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292–3306.
- A kernel method for the two-sample-problem. NeurIPS, 19.
- Bootstrap Your Own Latent: A new approach to self-supervised learning. In NeurIPS.
- Self-supervised Co-training for Video Representation Learning. In NeurIPS.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
- Masked autoencoders that listen. NeurIPS, 35: 28708–28720.
- Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
- Cooperative learning of audio and video models from self-supervised synchronization. In NeruIPS, 7774–7785.
- Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069.
- HMDB: a large video database for human motion recognition. In ICCV, 2556–2563.
- Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Active Contrastive Learning of Audio-Visual Video Representations. In ICLR.
- Mixed Precision Training. In ICLR.
- End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 9879–9889.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
- Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arXiv preprint arXiv:2106.06939.
- Self-supervised learning of pretext-invariant representations. In CVPR, 6707–6717.
- Robust Audio-Visual Instance Discrimination. In CVPR, 12934–12945.
- Audio-visual instance discrimination with cross-modal agreement. In CVPR, 12475–12486.
- BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations. arXiv preprint arXiv:2204.07402.
- Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation. arXiv preprint arXiv:2204.12260.
- Multi-modal Self-Supervision from Generalized Data Transformations. ICCV.
- Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In ICCV, 10560–10572.
- Piczak, K. J. 2015. ESC: Dataset for Environmental Sound Classification. In ACM Multimedia, 1015–1018. .
- Evolving losses for unsupervised video representation learning. In CVPR, 133–142.
- Spatiotemporal contrastive video representation learning. In CVPR, 6964–6974.
- Broaden your views for self-supervised video learning. In ICCV, 1255–1265.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
- Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In CVPR, 13325–13333.
- Imagenet large scale visual recognition challenge. IJCV, 115: 211–252.
- Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity. In AAAI, volume 37, 9723–9732.
- Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30.
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv preprint arXiv:2203.12602.
- Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335.
- Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529.
- Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740.
- MaCLR: Motion-Aware Contrastive Learning of Representations for Videos. In ECCV, 353–370. Springer.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
- Pritam Sarkar (14 papers)
- Ali Etemad (118 papers)