Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2102.05918v2)

Published 11 Feb 2021 in cs.CV, cs.CL, and cs.LG

Abstract: Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

In the paper titled “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” authors Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig present a novel approach designed to mitigate cost and complexity in the data curation process for large-scale visual and vision-language pre-training datasets. Their method leverages a noisy but extensive dataset of over one billion image alt-text pairs, bypassing the traditional need for costly data annotation and cleaning procedures.

Key Contributions

The authors introduce ALIGN (A Large-scale ImaGe and Noisy-text embedding), a dual-encoder model that aligns visual and language representations in a shared embedding space using contrastive loss. Despite utilizing a simple architecture, ALIGN yields state-of-the-art (SOTA) performance in numerous benchmarks, establishing its robustness and effectiveness in visual and vision-language tasks.

Methodology

The ALIGN framework builds upon a foundation of a large-scale, noisy dataset sourced directly from web alt-text—maintaining high data volume while significantly reducing preprocessing effort. This data is minimally cleaned using only frequency-based filtering, trading precision for scalability. The dual-encoder architecture comprises an EfficientNet-based image encoder and a BERT-based text encoder, optimized jointly through a normalized softmax loss function for effective alignment in the shared embedding space.

Numerical Results and Evaluation

The authors conduct thorough evaluations:

  • Cross-Modal Retrieval: On benchmarks like Flickr30K and MSCOCO, ALIGN achieves SOTA results, outperforming complex cross-attention models such as UNITER and CLIP. For instance, ALIGN improves zero-shot image-to-text recall@1 on Flickr30K by over 7 percentage points compared with CLIP.
  • Visual Classification: When transferred for zero-shot classification tasks, ALIGN matches or exceeds the performance of existing models. It substantiates this with a nuanced evaluation on ImageNet and its variants, showing high robustness to distributional shifts in datasets like ImageNet-R and ImageNet-A.
  • Efficiency: The model achieves 85.5% top-1 accuracy on ImageNet with frozen features and 88.64% when fully fine-tuned, showcasing a competitiveness comparable to highly curated dataset models while maintaining a substantial computational efficiency advantage.

Implications and Future Directions

The practical implications of this work are expansive. By reducing reliance on elaborately annotated datasets, it paves the way for cost-effective scaling of visual and vision-LLMs. The adaptability of ALIGN in both cross-modality and pure visual tasks underscores its potential for broad applications ranging from image retrieval systems to robust visual classifications in varying deployment scenarios.

Theoretically, the research underscores the latent potential in noisy data to overcome limitations imposed by dataset quality through sheer volume and robust algorithmic practices. This philosophy could catalyze a paradigm shift towards leveraging uncurated data, thus broadening the horizons of representation learning.

Conclusion

ALIGN exemplifies how large-scale web data, despite its noisiness, can be harnessed to achieve leading performance in visual and vision-language tasks. It challenges the conventional dependency on curated datasets and complex model architectures, thereby contributing fundamentally to the ongoing evolution in the field of AI. Future research might delve into incorporating more sophisticated noise-handling algorithms or exploring the compositional capabilities of embeddings further, particularly in multilingual contexts, as evidenced by the promising results from ALIGN's multilingual variants.

This research offers a substantial contribution to both practical AI deployment and theoretical understanding, showing that scalable, high-performance models can indeed arise from the noise.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chao Jia (8 papers)
  2. Yinfei Yang (73 papers)
  3. Ye Xia (14 papers)
  4. Yi-Ting Chen (54 papers)
  5. Zarana Parekh (11 papers)
  6. Hieu Pham (35 papers)
  7. Quoc V. Le (128 papers)
  8. Yunhsuan Sung (3 papers)
  9. Zhen Li (334 papers)
  10. Tom Duerig (11 papers)
Citations (3,195)
Youtube Logo Streamline Icon: https://streamlinehq.com