Mini-Batch Optimization of Contrastive Loss (2307.05906v1)
Abstract: Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging to consider all possible positive and negative pairs, leading to the use of mini-batch optimization. In this paper, we investigate the theoretical aspects of mini-batch optimization in contrastive learning. We show that mini-batch optimization is equivalent to full-batch optimization if and only if all $\binom{N}{B}$ mini-batches are selected, while sub-optimality may arise when examining only a subset. We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches. Our experimental results validate our theoretical findings and demonstrate that our proposed algorithm outperforms vanilla SGD in practically relevant settings, providing a better understanding of mini-batch optimization in contrastive learning.
- Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15302–15312, 2021.
- Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526–17535, 2020.
- Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
- Borodachov, S. Optimal antipodal configuration of 2d2𝑑2d2 italic_d points on a sphere in ℝdsuperscriptℝ𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for covering. arXiv preprint arXiv:2210.12472, 2022.
- Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp. 2624–2633, 2009.
- Tighter lower bounds for shuffling sgd: Random permutations and beyond, 2023.
- Why do we need large batchsizes in contrastive learning? a gradient-bias perspective. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
- Ice: Inter-instance contrastive encoding for unsupervised person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14960–14969, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- Sgda with shuffling: faster convergence for nonconvex-pł minimax optimization, 2023.
- Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pp. 539–546. IEEE, 2005.
- Introduction to algorithms. MIT press, 2022.
- Crouse, D. F. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
- Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Cyclip: Cyclic contrastive language-image pretraining. In Advances in Neural Information Processing Systems, volume 35, pp. 6704–6719. Curran Associates, Inc., 2022.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49–84, 2021.
- Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624–2633. PMLR, 2019.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
- Adco: Adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1074–1083, 2021.
- A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
- An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. PMLR, 2021.
- Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809, 2020.
- Ordered sgd: A new stochastic optimization framework for empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pp. 669–679. PMLR, 2020.
- Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
- UniCLIP: Unified framework for contrastive language-image pre-training. In Advances in Neural Information Processing Systems, 2022.
- Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
- Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224–241, 2022. ISSN 1063-5203. Special Issue on Harmonic Analysis and Machine Learning.
- A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations, 2021.
- Grab: Finding provably better data permutations than random reshuffling. In Advances in Neural Information Processing Systems, 2022.
- Active contrastive learning of audio-visual video representations. In International Conference on Learning Representations, 2021.
- Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33:17309–17320, 2020.
- Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717, 2020.
- Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703–4711. PMLR, 2019.
- On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.
- A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
- Permutation-based SGD: Is random optimal? In International Conference on Learning Representations, 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pp. 11.1–11.24. PMLR, 2012.
- Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
- Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021.
- Calm: Contrastive aligned audio-language multirate and multimodal representations. arXiv preprint arXiv:2202.03587, 2022.
- How good is sgd with random shuffling?, 2021a.
- Random shuffling beats sgd only after many epochs on ill-conditioned problems. Advances in Neural Information Processing Systems, 34:15151–15161, 2021b.
- Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.
- Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.
- Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2020.
- On the existence of equiangular tight frames. Linear Algebra and its applications, 426(2-3):619–635, 2007.
- Smg: A shuffling gradient-based method with momentum. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10379–10389. PMLR, 2021.
- Contrastive learning with stronger augmentations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Decoupled contrastive learning. In European Conference on Computer Vision, pp. 668–684. Springer, 2022.
- Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390–1408, 2020. doi: 10.1109/TSP.2020.2968280.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In International Conference on Machine Learning, pp. 25760–25782. PMLR, 2022.
- Positional contrastive learning for volumetric medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, pp. 221–230. Springer, 2021.
- Understanding hard negatives in noise contrastive estimation. In North American Chapter of the Association for Computational Linguistics, 2021.
- On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022.
- Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1450–1459, 2021.