Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining (2404.01156v1)

Published 1 Apr 2024 in cs.CV and cs.AI

Abstract: Vision-LLMs (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However, in fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text, leading to cases where some textual details are not visible in individual images. This mismatch, particularly when non-co-occurring elements are masked, undermines the training of conventional VLM objectives like Masked LLMing and Masked Image Modeling, thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem, we propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model, ensuring a precise alignment between the two modalities. Additionally, we enhance grouped batch sampling with semi-hard negatives, effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach, outperforming existing methods in three downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Masked autoencoders enable efficient knowledge distillers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24256–24265, 2023.
  2. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  3. Grit-vlp: Grouped mini-batch sampling for efficient vision and language pre-training. In European Conference on Computer Vision, pages 395–412. Springer, 2022.
  4. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  5. Image search with text feedback by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
  10. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2251–2260, 2020.
  11. Ross Girshick. Fast R-CNN. In ICCV, 2015.
  12. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14105–14115, 2022.
  13. Fashionvil: Fashion-focused vision-and-language representation learning. In ECCV. Springer, 2022.
  14. Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2669–2680, 2023.
  15. Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training. In CVPR, 2023.
  16. Masked autoencoders are scalable vision learners. In arXiv:2111.06377, 2021.
  17. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  18. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976–12985, 2021.
  19. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. 2020.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  21. What to hide from your students: Attention-guided masked image modeling. In ECCV, 2022.
  22. Dual compositional learning in interactive image retrieval. In AAAI, pages 1771–1779, 2021.
  23. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  24. Masked vision and language modeling for multi-modal representation learning. In ICLR, 2023.
  25. Cosmo: Content-style modulation for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 802–812, 2021.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
  27. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  28. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
  29. MST: Masked self-supervised transformer for visual representation. In NeurIPS, volume 34, 2021.
  30. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  33. Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18051–18061, 2022.
  34. A unified view of masked image modeling. In arXiv preprint arXiv:2210.10615, 2022.
  35. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
  36. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  37. Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317, 2018.
  38. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2019.
  39. Lxmert: Learning cross-modality encoder representations from transformers. 2019.
  40. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307–11317, 2021.
  41. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  42. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. volume 34, pages 4514–4528, 2021.
  43. Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4433–4442, 2022.
  44. Mamo: Fine-grained vision-language representations learning with masked multimodal modeling. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1528–1538, 2023.
  45. Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12647–12657, 2021.

Summary

We haven't generated a summary for this paper yet.