T-VSE: Transformer-Based Visual Semantic Embedding (2005.08399v1)

Published 17 May 2020 in cs.CV

Abstract: Transformer models have recently achieved impressive performance on NLP tasks, owing to new algorithms for self-supervised pre-training on very large text corpora. In contrast, recent literature suggests that simple average word models outperform more complicated LLMs, e.g., RNNs and Transformers, on cross-modal image/text search tasks on standard benchmarks, like MS COCO. In this paper, we show that dataset scale and training strategy are critical and demonstrate that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.

Citations (7)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

T-VSE: Transformer-Based Visual Semantic Embedding (2005.08399v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (3)