On the Difference of BERT-style and CLIP-style Text Encoders (2306.03678v1)
Abstract: Masked LLMing (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.
- Zhihong Chen (63 papers)
- Guiming Hardy Chen (8 papers)
- Shizhe Diao (47 papers)
- Xiang Wan (93 papers)
- Benyou Wang (109 papers)