Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning (2211.13929v5)

Published 25 Nov 2022 in cs.CV

Abstract: We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by $8\%$ to $14\%$ on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by $5.5\%$ on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of $96.5\%$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
  2. Asr is all you need: Cross-modal distillation for lip reading. In ICASSP, 2143–2147. IEEE.
  3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34.
  4. Self-Supervised MultiModal Versatile Networks. NeurIPS, 2(6): 7.
  5. Emotion recognition in speech using cross-modal transfer in the wild. In ACM Multimedia, 292–301.
  6. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS, 33.
  7. Look, listen and learn. In ICCV, 609–617.
  8. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS.
  9. Soundnet: Learning sound representations from unlabeled video. NeurIPS, 29.
  10. Mae-ast: Masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691.
  11. MultiMAE: Multi-modal Multi-task Masked Autoencoders. arXiv preprint arXiv:2204.01678.
  12. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
  13. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  14. Emerging properties in self-supervised vision transformers. In ICCV, 9650–9660.
  15. A simple framework for contrastive learning of visual representations. In ICML, 1597–1607.
  16. Exploring simple siamese representation learning. In CVPR, 15750–15758.
  17. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
  18. Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 7016–7025.
  19. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. arXiv preprint arXiv:2204.12768.
  20. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
  21. Learning an augmented rgb representation with cross-modal knowledge distillation for action detection. In ICCV, 13053–13064.
  22. Scaling vision transformers to 22 billion parameters. In ICML, 7480–7512. PMLR.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  25. Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113.
  26. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 3299–3309.
  27. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852.
  28. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 776–780.
  29. Imagebind: One embedding space to bind them all. In CVPR, 15180–15190.
  30. Omnivore: A single model for many visual modalities. In CVPR, 16102–16112.
  31. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.
  32. Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292–3306.
  33. A kernel method for the two-sample-problem. NeurIPS, 19.
  34. Bootstrap Your Own Latent: A new approach to self-supervised learning. In NeurIPS.
  35. Self-supervised Co-training for Video Representation Learning. In NeurIPS.
  36. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
  37. Masked autoencoders that listen. NeurIPS, 35: 28708–28720.
  38. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387.
  39. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  40. Cooperative learning of audio and video models from self-supervised synchronization. In NeruIPS, 7774–7785.
  41. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069.
  42. HMDB: a large video database for human motion recognition. In ICCV, 2556–2563.
  43. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691.
  44. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  45. Active Contrastive Learning of Audio-Visual Video Representations. In ICLR.
  46. Mixed Precision Training. In ICLR.
  47. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 9879–9889.
  48. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  49. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arXiv preprint arXiv:2106.06939.
  50. Self-supervised learning of pretext-invariant representations. In CVPR, 6707–6717.
  51. Robust Audio-Visual Instance Discrimination. In CVPR, 12934–12945.
  52. Audio-visual instance discrimination with cross-modal agreement. In CVPR, 12475–12486.
  53. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations. arXiv preprint arXiv:2204.07402.
  54. Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation. arXiv preprint arXiv:2204.12260.
  55. Multi-modal Self-Supervision from Generalized Data Transformations. ICCV.
  56. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In ICCV, 10560–10572.
  57. Piczak, K. J. 2015. ESC: Dataset for Environmental Sound Classification. In ACM Multimedia, 1015–1018. .
  58. Evolving losses for unsupervised video representation learning. In CVPR, 133–142.
  59. Spatiotemporal contrastive video representation learning. In CVPR, 6964–6974.
  60. Broaden your views for self-supervised video learning. In ICCV, 1255–1265.
  61. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
  62. Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In CVPR, 13325–13333.
  63. Imagenet large scale visual recognition challenge. IJCV, 115: 211–252.
  64. Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity. In AAAI, volume 37, 9723–9732.
  65. Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
  66. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  67. Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
  68. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30.
  69. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv preprint arXiv:2203.12602.
  70. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335.
  71. Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529.
  72. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740.
  73. MaCLR: Motion-Aware Contrastive Learning of Representations for Videos. In ECCV, 353–370. Springer.
  74. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
  75. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Pritam Sarkar (14 papers)
  2. Ali Etemad (118 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com