Bidirectional Contrastive Split Learning for Visual Question Answering (2208.11435v4)
Abstract: Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module and leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows much better robustness to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.
- Deep Learning with Differential Privacy. In ACM Conference on Computer and Communications Security.
- VQA: Visual Question Answering - www.visualqa.org. In Int. J. Comput. Vis., volume 123, 4–31.
- Self-Supervised MultiModal Versatile Networks. In NeurIPS.
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR.
- Bishop, C. M. 2006. Pattern recognition and machine learning. Springer.
- et al., A. R. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Int. J. Comput. Vis., volume 127, 398–414.
- SSFL: Tackling Label Deficiency in Federated Learning via Personalized Self-Supervision.
- Long Short-Term Memory. Neural Comput., 9(8): 1735–1780.
- Bilinear Attention Networks. In NeurIPS.
- Kim, T. K. 2015. T test as a parametric statistic. In Korean booktitleof anesthesiology, volume 68.
- ImageNet classification with deep convolutional neural networks. Commun. ACM, 60(6): 84–90.
- FedMD: Heterogenous Federated Learning via Model Distillation.
- Microsoft COCO: Common Objects in Context. In ECCV.
- Federated Learning for Vision-and-Language Grounding Problems. In AAAI.
- Trojaning Attack on Neural Networks. In Annual Network and Distributed System Security Symposium.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In AISTATS.
- Representation Learning with Contrastive Predictive Coding. In arXiv Preprint.
- Multi-modal Self-Supervision from Generalized Data Transformations. In arXiv preprint.
- GloVe: Global Vectors for Word Representation. In EMNLP.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. In arXiv preprint arXiv.2204.06125.
- Zero-Shot Text-to-Image Generation. In ICML.
- AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. In Annual Conference of the International Speech Communication Association.
- Decentralized Deep Learning for Multi-Access Edge Computing: A Survey on Communication Efficiency and Trustworthiness. In IEEE Transactions on Artificial Intelligence.
- Instance-level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space. arXiv:2304.00436.
- SplitFed: When Federated Learning Meets Split Learning. In AAAI.
- Attention is All you Need. In NeurIPS.
- Swarm Learning for decentralized and confidential clinical machine learning.
- Stacked Attention Networks for Image Question Answering. In CVPR.
- Deep Multimodal Neural Architecture Search. In ACM Multimedia.
- Deep Modular Co-Attention Networks for Visual Question Answering. In CVPR.
- Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In ICCV.
- Overcoming Language Priors with Self-supervised Learning for Visual Question Answering. In IJCAI.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.