Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
42 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
86 tokens/sec
GPT OSS 120B via Groq Premium
464 tokens/sec
Kimi K2 via Groq Premium
181 tokens/sec
2000 character limit reached

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning (2405.16401v2)

Published 26 May 2024 in cs.CV and cs.LG

Abstract: Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.