Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning (2307.03132v2)

Published 6 Jul 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Common crawl. https://commoncrawl.org/.
  2. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  3. Approximating extent measures of points. Journal of the ACM, 51, 03 2003. doi: 10.1145/1008731.1008736.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), pages 9453–9463, 2019.
  5. Distribution density, tails, and outliers in machine learning: Metrics and applications. ArXiv, abs/1910.13427, 2019.
  6. Satrajit Chatterjee. Coherent gradients: An approach to understanding generalization in gradient descent-based optimization. arXiv preprint arXiv:2002.10657, 2020.
  7. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  8. Fast: Faster arbitrarily-shaped text detector with minimalist kernel representation, 2021.
  9. Unified scaling laws for routed language models, 2022.
  10. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022.
  11. Scalable training of mixture models via coresets. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/2b6d65b9a9445c4271ab9076ead5605a-Paper.pdf.
  12. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020.
  13. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  14. Datacomp: In search of the next generation of multimodal datasets, 2023.
  15. Finetune like you pretrain: Improved finetuning of zero-shot vision models, 2022.
  16. On coresets for k-means and k-median clustering. In Symposium on the Theory of Computing, 2004.
  17. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
  18. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
  19. Scaling laws for transfer, 2021.
  20. Deep learning scaling is predictable, empirically, 2017.
  21. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  22. Jeremy Howard. Imagenette. URL https://github.com/fastai/imagenette/.
  23. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  25. Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22:691–700, 01 2001.
  26. Characterizing structural regularities of labeled data in overparameterized models. arXiv preprint arXiv:2002.03206, 2020.
  27. Scaling laws for neural language models, 2020.
  28. Deconstructing distributions: A pointwise framework of learning. arXiv preprint arXiv:2202.09931, 2022.
  29. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations (ICLR), 2022.
  30. Deduplicating training data makes language models better, 2022.
  31. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  32. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  33. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm, 2022.
  34. Characterizing datapoints via second-split forgetting. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yKDKNzjHg8N.
  35. Do deep neural networks learn shallow learnable examples first? In ICML 2019 Workshop on Identifying and Understanding Deep Learning Phenomena, 2019. URL https://openreview.net/forum?id=HkxHv4rn24.
  36. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
  37. Slip: Self-supervision meets language-image pre-training, 2021.
  38. Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050, 2021.
  39. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv:2301.02280, 2023.
  40. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pages 8748–8763, 2021.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
  42. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.
  43. Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8, 03 2018. doi: 10.1002/widm.1236.
  44. Imagenet large scale visual recognition challenge, 2015.
  45. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  46. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  47. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585, 2020.
  48. Beyond neural scaling laws: beating power law scaling via data pruning, 2023.
  49. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  50. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
  51. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
  52. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241, 2023.
  53. Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  54. Findout: Finding outliers in very large datasets. Knowl. Inf. Syst., 4:387–412, 09 2002. doi: 10.1007/s101150200013.
  55. A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
  56. Scaling vision transformers, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pratyush Maini (19 papers)
  2. Sachin Goyal (17 papers)
  3. J. Zico Kolter (151 papers)
  4. Aditi Raghunathan (56 papers)
  5. Zachary C. Lipton (137 papers)
Citations (28)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com