Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating the Benefits of Projection Head for Representation Learning (2403.11391v1)

Published 18 Mar 2024 in cs.LG and cs.CV

Abstract: An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pp.  244–253. PMLR, 2018.
  2. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
  3. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  4. Variance-invariance-covariance regularization for self-supervised learning. ICLR, Vicreg, 1:2, 2022.
  5. Generalization performance of support vector machines and other pattern classifiers. 1999. URL https://api.semanticscholar.org/CorpusID:61853586.
  6. Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning. Transactions of Machine Learning Research (TMLR), 2023.
  7. Project and probe: Sample-efficient adaptation by interpolating orthogonal features. In The Twelfth International Conference on Learning Representations, 2023.
  8. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. In International Conference on Machine Learning, pp.  3090–3122. PMLR, 2022.
  9. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  10. Debiased contrastive learning. Advances in neural information processing systems, 33:8765–8775, 2020.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Whitening for self-supervised representation learning. In International Conference on Machine Learning, pp.  3015–3024. PMLR, 2021.
  13. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  14. On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574, 2022.
  15. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  16. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pp.  3821–3830. PMLR, 2021.
  17. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  18. Unraveling projection heads in contrastive learning: Insights from expansion and shrinkage. arXiv preprint arXiv:2306.03335, 2023.
  19. Implicit regularization in matrix factorization. Advances in neural information processing systems, 30, 2017.
  20. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv preprint arXiv:2106.02073, 2021.
  21. A theoretical study of inductive biases in contrastive learning. arXiv preprint arXiv:2211.14699, 2022.
  22. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
  23. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9729–9738, 2020.
  24. Using pre-training can improve model robustness and uncertainty. In International conference on machine learning, pp.  2712–2721. PMLR, 2019.
  25. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  26. Natural adversarial examples, 2021b.
  27. Limitations of neural collapse for understanding generalization in deep learning. arXiv preprint arXiv:2202.08384, 2022.
  28. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
  29. The power of contrast for feature learning: A theoretical analysis. arXiv preprint arXiv:2110.02473, 2021.
  30. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
  31. Data-efficient contrastive self-supervised learning: Most beneficial examples for supervised learning contribute the least. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  15356–15370. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/joshi23b.html.
  32. Towards mitigating spurious correlations in the wild: A benchmark and a more realistic dataset, 2023.
  33. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  34. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  35. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
  36. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp.  5637–5664. PMLR, 2021.
  37. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1920–1929, 2019.
  38. Learning multiple layers of features from tiny images. 2009.
  39. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  40. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20071–20082, 2023.
  41. Self-supervised learning is more robust to dataset imbalance. arXiv preprint arXiv:2110.05025, 2021.
  42. Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224–241, 2022.
  43. You only need a good embeddings extractor to fix spurious correlations. arXiv preprint arXiv:2212.06254, 2022.
  44. Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics, pp.  4348–4380. PMLR, 2023.
  45. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  46. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  47. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  48. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp.  5389–5400. PMLR, 2019.
  49. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. arXiv preprint arXiv:2202.06856, 2022.
  50. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  51. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pp.  8346–8356. PMLR, 2020.
  52. Understanding contrastive learning requires incorporating inductive biases. arXiv preprint arXiv:2202.14037, 2022.
  53. Do image classifiers generalize across time? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9661–9669, 2021.
  54. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  55. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
  56. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pp.  10268–10278. PMLR, 2021.
  57. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2021.
  58. The mechanism of prediction head in non-contrastive self-supervised learning. Advances in Neural Information Processing Systems, 35:24794–24809, 2022.
  59. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020.
  60. Investigating why contrastive learning benefits robustness against label noise. In International Conference on Machine Learning, pp.  24851–24871. PMLR, 2022.
  61. Which features are learnt by contrastive learning? on the role of simplicity bias in class collapse and feature suppression. arXiv preprint arXiv:2305.16536, 2023.
  62. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6210–6219, 2019.
  63. Decoupled contrastive learning. In European Conference on Computer Vision, pp.  668–684. Springer, 2022.
  64. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp.  12310–12320. PMLR, 2021.
  65. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp.  27179–27202. PMLR, 2022a.
  66. Are all losses created equal: A neural collapse perspective. arXiv preprint arXiv:2210.02192, 2022b.
  67. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
  68. Object recognition with and without objects. arXiv preprint arXiv:1611.06596, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yihao Xue (10 papers)
  2. Eric Gan (6 papers)
  3. Jiayi Ni (5 papers)
  4. Siddharth Joshi (28 papers)
  5. Baharan Mirzasoleiman (51 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com