TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality (2302.02224v3)
Abstract: This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.
- Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255. PMLR.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
- Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32.
- Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100.
- Cha, J. (1994). Partial least squares. Advanced methods of marketing research, 407:52–78.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Skyformer: Remodel self-attention with gaussian kernel and nyström method. Advances in Neural Information Processing Systems, 34:2122–2135.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- Formation of magnesium dendrites during electrodeposition. ACS Energy Letters, 4(2):375–376.
- Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142.
- The l_1𝑙_1l\_1italic_l _ 1 convergence of kernel density estimates. The Annals of Statistics, 7(5):1136–1139.
- Dietterich, T. G. et al. (2002). Ensemble learning. The handbook of brain theory and neural networks, 2(1):110–125.
- Semisupervised self-learning for hyperspectral image classification. IEEE transactions on geoscience and remote sensing, 51(7):4032–4044.
- When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering, 23(5):788–799.
- Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7):1553–1568.
- Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941.
- Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26.
- Partial least-squares regression: a tutorial. Analytica chimica acta, 185:1–17.
- Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664.
- Multimodal deep autoencoder for human pose recovery. IEEE transactions on image processing, 24(12):5659–5670.
- Semi-supervised multi-view deep discriminant representation learning. IEEE transactions on pattern analysis and machine intelligence, 43(7):2496–2509.
- Intra-view and inter-view supervised correlation analysis for multi-view feature learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28.
- Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
- A random forest-based framework for crop mapping using temporal, spectral, textural and polarimetric observations. International Journal of Remote Sensing, 40(18):7221–7251.
- Msmd: maximum separability and minimum dependency feature selection for cropland classification from optical and radar data. International Journal of Remote Sensing, 39(8):2159–2176.
- Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), pages 36–45.
- Email classification with co-training. In Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, page 8.
- Babytalk: Understanding and generating simple image descriptions. IEEE transactions on pattern analysis and machine intelligence, 35(12):2891–2903.
- Lee, D.-H. et al. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896.
- Cross-modal learning with adversarial samples. Advances in neural information processing systems, 32.
- A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering, 31(10):1863–1883.
- Inter-modality face recognition. In European conference on computer vision, pages 13–26. Springer.
- Stable, fast and accurate: Kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34:22795–22807.
- Regularizing long short term memory with 3d human-skeleton sequences for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3054–3062.
- Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756.
- Activity recognition using wearable physiological measurements: Selection of features from a comprehensive literature study. Sensors, 19(24).
- Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1):141–142.
- Multimodal deep learning. In ICML.
- Multi-view clustering and semi-supervised classification with adaptive neighbours. In Thirty-first AAAI conference on artificial intelligence.
- Convex multiview semi-supervised classification. IEEE Transactions on Image Processing, 26(12):5718–5729.
- Multiview semi-supervised learning model for image classification. IEEE Transactions on Knowledge and Data Engineering, 32(12):2389–2400.
- Random feature attention. In International Conference on Learning Representations.
- Ccl: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia, 20(2):405–420.
- Semi-supervised self-training of object detection models. In Applications of Computer Vision and the IEEE Workshop on Motion and Video Computing, IEEE Workshop on, volume 1, pages 29–36. IEEE Computer Society.
- Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The annals of mathematical statistics, pages 832–837.
- Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Generalized multiview analysis: A discriminative latent space. In 2012 IEEE conference on computer vision and pattern recognition, pages 2160–2167. IEEE.
- Task Report: Memotion Analysis 1.0 @SemEval 2020: The Visuo-Lingual Metaphor! In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain. Association for Computational Linguistics.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608.
- Improved multimodal deep learning with variation of information. Advances in neural information processing systems, 27.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
- Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1):355–370.
- Separating style and content with bilinear models. Neural computation, 12(6):1247–1283.
- Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information systems, 42(2):245–284.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access.
- A survey on semi-supervised learning. Machine Learning, 109(2):373–440.
- Attention is all you need. Advances in neural information processing systems, 30.
- Translating videos to natural language using deep recurrent neural networks. In HLT-NAACL.
- Wan, X. (2009). Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 235–243.
- Kernel Smoothing. CRC press.
- A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215.
- Orcca: Optimal randomized canonical correlation analysis. IEEE Transactions on Neural Networks and Learning Systems.
- Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pages 359–372.
- Hysad: A semi-supervised hybrid shilling attack detector for trustworthy product recommendation. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 985–993.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148.
- Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pages 189–196.
- Zhu, X. J. (2005). Semi-supervised learning literature survey. Technical Report TR 1530.
- Rethinking pre-training and self-training. Advances in neural information processing systems, 33:3833–3845.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.