Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study of Self-supervised Learning with Wasserstein Distance (2310.10143v2)

Published 16 Oct 2023 in stat.ML and cs.LG

Abstract: In this study, we delve into the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Wasserstein generative adversarial networks. In ICML, 2017.
  2. Scalable nearest neighbor search for optimal transport. In ICML, 2020.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992.
  5. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
  6. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  7. A simple framework for contrastive learning of visual representations. In ICML, 2020a.
  8. Exploring simple siamese representation learning. In CVPR, 2021.
  9. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  10. Elements of information theory. John Wiley & Sons, 2012.
  11. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
  12. Arcface: Additive angular margin loss for deep face recognition. In CVPR, 2019.
  13. Max-sliced Wasserstein distance and its use for GANs. In CVPR, 2019.
  14. Approximating 1-wasserstein distance between persistence diagrams by graph sparsification. In ALENEX, 2022.
  15. The phylogenetic kantorovich–rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):569–592, 2012.
  16. Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47, 1953.
  17. Learning with a wasserstein loss. NIPS, 2015.
  18. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT, 2005.
  19. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
  20. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  21. Fast image retrieval via embeddings. In 3rd international workshop on statistical and computational theories of vision, volume 2, page 5. Nice, France, 2003.
  22. Understanding and constructing latent modality structures in multi-modal representation learning. In CVPR, 2023.
  23. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  24. Sliced wasserstein kernels for probability distributions. In CVPR, 2016.
  25. Generalized sliced wasserstein distances. In NeurIPS, 2019.
  26. Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  27. From word embeddings to document distances. In ICML, 2015.
  28. Simplicial embeddings in self-supervised learning and downstream classification. arXiv preprint arXiv:2204.00616, 2022.
  29. Entropy partial transport with tree metrics: Theory and practice. In AISTATS, 2021.
  30. Tree-sliced approximation of wasserstein distances. NeurIPS, 2019.
  31. Semantic correspondence as an optimal transport problem. In CVPR, 2020.
  32. Unifrac: a new phylogenetic method for comparing microbial communities. Applied and environmental microbiology, 71(12):8228–8235, 2005.
  33. Principal differences analysis: Interpretable characterization of differences between distributions. NIPS, 2015.
  34. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  35. Subspace robust wasserstein distances. In ICML, 2019.
  36. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
  37. Concentration of measure inequalities in information theory, communications, and coding. Foundations and Trends® in Communications and Information Theory, 10(1-2):1–246, 2013.
  38. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
  39. Fast unbalanced optimal transport on tree. In NeurIPS, 2020.
  40. Re-evaluating word mover’s distance. ICML, 2022.
  41. Supervised tree-wasserstein distance. In ICML, 2021.
  42. Fixed support tree-sliced wasserstein barycenter. AISTATS, 2022.
  43. Computationally efficient Wasserstein loss for structured labels. In ECAL: Student Research Workshop, April 2021.
  44. A note on connecting barlow twins with negative-sample-free contrastive learning. arXiv preprint arXiv:2104.13712, 2021.
  45. J v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
  46. Attention is all you need. NIPS, 2017.
  47. Fair and optimal classification via post-processing. In ICML, 2023.
  48. Approximating 1-wasserstein distance with trees. TMLR, 2022.
  49. Word rotator’s distance. EMNLP, 2020.
  50. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
  51. Han Zhao. Costs and benefits of fair regression. Transactions on Machine Learning Research, 2022.
  52. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. EMNLP-IJCNLP, 2019.

Summary

We haven't generated a summary for this paper yet.