Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning (2402.10150v1)

Published 15 Feb 2024 in cs.LG

Abstract: In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the $f$-Mutual Information in Contrastive Learning ($f$-MICL) using the $f$-divergences. To answer the first question, we provide a wide range of $f$-MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an $f$-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the $f$-MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate $f$-MICL with different $f$-divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that $f$-MICL generally outperforms the benchmarks and the best-performing $f$-divergence is task and dataset dependent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. fđť‘“fitalic_f-domain-adversarial learning: Theory and algorithms. In ICML, 2021.
  2. *SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (*SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, 2013.
  3. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 1966.
  4. Learning representations by maximizing mutual information across views. NeurIPS, 2019.
  5. A theory of learning with similarity functions. Machine Learning, 72(1):89–112, 2008.
  6. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  7. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. JMLR, 2019.
  8. MINE: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
  9. Discrete energy on rectifiable sets. Springer, 2019.
  10. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
  11. A simple framework for contrastive learning of visual representations. In ICML, 2020. URL http://proceedings.mlr.press/v119/chen20j.html.
  12. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15750–15758, 2021.
  13. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9640–9649, 2021.
  14. A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv preprint arXiv:1707.08819, 2017.
  15. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  215–223. JMLR Workshop and Conference Proceedings, 2011.
  16. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, pp.  229–318, 1967.
  17. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE, 2009.
  18. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  19. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  20. Density estimation using real NVP. In ICLR, 2017.
  21. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47, 1975.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  23. SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021.
  24. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
  25. Provable guarantees for self-supervised deep learning with spectral contrastive loss. NeurIPS, 2021.
  26. Deep residual learning for image recognition. In CVPR, 2016.
  27. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  28. Data-efficient image recognition with contrastive predictive coding. In ICML. PMLR, 2020.
  29. Learning deep representations by mutual information estimation and maximization. In ICLR, 2018.
  30. Boosting contrastive self-supervised learning with false negative cancellation. arXiv preprint arXiv:2011.11765, 2020.
  31. Deep metric learning: A survey. Symmetry, 2019.
  32. Glow: generative flow with invertible 1Ă—\timesĂ— 1 convolutions. In NeruIPS, 2018.
  33. Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
  34. Learning multiple layers of features from tiny images, 2009. Technical report.
  35. Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 2012.
  36. Rényicl: Contrastive representation learning with skew rényi divergence. Advances in Neural Information Processing Systems, 35:6463–6477, 2022.
  37. Ralph Linsker. Self-organization in a perceptual network. Computer, 1988.
  38. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  39. SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017.
  40. Decoupled weight decay regularization. In ICLR, 2018.
  41. K. V. Mardia. Statistics of directional data. Journal of the Royal Statistical Society: Series B, 37(3):349––371, 1975. URL https://doi.org/10.1111/j.2517-6161.1975.tb01550.x.
  42. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pp.  875–884. PMLR, 2020.
  43. Foundations of machine learning. MIT press, 2018.
  44. Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
  45. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.
  46. fđť‘“fitalic_f-GAN: Training generative neural samplers using variational divergence minimization. In NeurIPS, 2016.
  47. Wasserstein dependency measure for representation learning. In NeurIPS, 2019.
  48. On variational bounds of mutual information. In International Conference on Machine Learning, pp.  5171–5180. PMLR, 2019.
  49. M. J. D. Powell. Radial Basis Functions for Multivariable Interpolation: A Review. Clarendon Press, USA, 1987. ISBN 0198536127.
  50. Ralph Rockafellar. Characterization of the subdifferentials of convex functions. Pacific Journal of Mathematics, 1966.
  51. fđť‘“fitalic_f-divergence inequalities. IEEE Transactions on Information Theory, 2016.
  52. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019. URL http://proceedings.mlr.press/v97/saunshi19a.html.
  53. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges (with appendices on mathematical background and detailed algorithms explanation). arXiv preprint arXiv:1812.05944, 2018.
  54. Contrastive multiview coding. In ECCV. Springer, 2020a.
  55. What makes for good views for contrastive learning. In NeurIPS, 2020b. URL https://proceedings.neurips.cc//paper_files/paper/2020/hash/4c2e5eaae9152079b9e95845750bb9ab-Abstract.html.
  56. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory. PMLR, 2021.
  57. Self-supervised representation learning with relative predictive coding. In ICLR, 2021.
  58. Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 1988.
  59. On mutual information maximization for representation learning. In ICLR, 2020.
  60. Convex analysis and minimization algorithms. Springer-Verlag, 1993.
  61. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  62. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML. PMLR, 2020.
  63. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  64. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp.  12310–12320. PMLR, 2021.
  65. Zero-cl: Instance and feature decorrelation for negative-free symmetric contrastive learning. In International Conference on Learning Representations, 2021.
Citations (4)

Summary

We haven't generated a summary for this paper yet.