Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Generating Synthetic Datasets by Interpolating along Generalized Geodesics (2306.06866v1)

Published 12 Jun 2023 in cs.LG and cs.AI

Abstract: Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  2. D. Alvarez-Melis and N. Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020.
  3. D. Alvarez-Melis and N. Fusi. Dataset dynamics via gradient flows in probability space. In International Conference on Machine Learning, pages 219–230. PMLR, 2021.
  4. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  5. Neural optimal transport with general cost functionals. arXiv preprint arXiv:2205.15403, 2022.
  6. GAN augmentation: Augmenting training data using generative adversarial networks, Oct. 2018.
  7. Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  8. Language models are Few-Shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  9. Supervised training of conditional monge maps. arXiv preprint arXiv:2206.14262, 2022a.
  10. Proximal optimal transport modeling of population dynamics. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 6511–6528. PMLR, 2022b.
  11. C.-Y. Chuang and Y. Mroueh. Fair mixup: Fairness via interpolation. In International Conference on Learning Representations, 2021.
  12. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
  13. K. Craig. The exponential formula for the wasserstein metric. ESAIM: Control, Optimisation and Calculus of Variations, 22(1):169–187, 2016.
  14. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376. PMLR, 2018.
  15. Scalable computations of wasserstein barycenter via input convex neural networks. arXiv preprint arXiv:2007.04462, 2020.
  16. Neural monge map estimation and its applications. arXiv preprint arXiv:2106.03812, 2021.
  17. Variational Wasserstein gradient flow. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6185–6215. PMLR, 2022.
  18. Y. Gao and P. Chaudhari. An Information-Geometric distance on the space of tasks. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021.
  19. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  20. Dynamic flows on curved space generated by labeled data. arXiv preprint arXiv:2302.00061, 2023.
  21. J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
  22. A data-based perspective on transfer learning. arXiv preprint arXiv:2207.05739, 2022.
  23. Spinalnet: Deep neural network with gradual input. IEEE Transactions on Artificial Intelligence, 2022.
  24. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521–3526, mar 2017.
  25. Continuous wasserstein-2 barycenter estimation without minimax optimization. arXiv preprint arXiv:2102.01752, 2021.
  26. Wasserstein iterative networks for barycenter estimation. arXiv preprint arXiv:2201.12245, 2022a.
  27. Neural optimal transport. arXiv preprint arXiv:2201.12220, 2022b.
  28. Wasserstein gan with quadratic transport cost. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4832–4841, 2019.
  29. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning, volume 37, 2020.
  30. R. J. McCann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309–323, 1995.
  31. R. J. McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
  32. M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In G. H. Bower, editor, Psychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, Jan. 1989. 10.1016/S0079-7421(08)60536-8.
  33. Large-Scale wasserstein gradient flows. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15243–15256. Curran Associates, Inc., 2021.
  34. Mapping estimation for discrete optimal transport. Advances in Neural Information Processing Systems, 29, 2016.
  35. A.-A. Pooladian and J. Niles-Weed. Entropic estimation of optimal transport maps. arXiv preprint arXiv:2109.12004, 2021.
  36. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  37. Generative modeling with optimal transport maps. In International Conference on Learning Representations, 2022.
  38. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci. Rep., 9(1):16884, Nov. 2019. ISSN 2045-2322. 10.1038/s41598-019-52737-x.
  39. F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
  40. F. Santambrogio. {{\{{Euclidean, metric, and Wasserstein}}\}} gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
  41. R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
  42. Scalable bayes via barycenter in wasserstein space. The Journal of Machine Learning Research, 19(1):312–346, 2018.
  43. C. Villani. Optimal transport, Old and New, volume 338. Springer Science & Business Media, 2008. ISBN 9783540710493.
  44. Meta-learning with fewer tasks through task interpolation. arXiv preprint arXiv:2106.02695, 2021.
  45. Hierarchical optimal transport for comparing histopathology datasets. arXiv preprint arXiv:2204.08324, 2022.
  46. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
  47. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  48. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  49. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021.
  50. Interpolation for robust learning: Data augmentation on geodesics. arXiv preprint arXiv:2302.02092, 2023.
Citations (9)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.