Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Difference of BERT-style and CLIP-style Text Encoders (2306.03678v1)

Published 6 Jun 2023 in cs.CL

Abstract: Masked LLMing (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhihong Chen (63 papers)
  2. Guiming Hardy Chen (8 papers)
  3. Shizhe Diao (47 papers)
  4. Xiang Wan (93 papers)
  5. Benyou Wang (109 papers)
Citations (15)