Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing (2407.15580v3)

Published 22 Jul 2024 in cs.LG, cs.SD, eess.AS, math.PR, and stat.ML

Abstract: We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75:245–248, 2009.
  2. Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972.
  3. k-means++: The advantages of careful seeding. In Soda, volume 7, pages 1027–1035, 2007.
  4. Toby Berger. Rate-distortion theory. Wiley Encyclopedia of Telecommunications, 2003.
  5. Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.
  6. Richard E Blahut. Principles and practice of information theory. Addison-Wesley Longman Publishing Co., Inc., 1987.
  7. Centroidal power diagrams, lloyd’s algorithm, and applications to optimal location problems. SIAM Journal on Numerical Analysis, 53(6):2545–2569, 2015.
  8. Alpha model domination in multiple choice learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 879–884. IEEE, 2018.
  9. Imre Csiszár. On the computation of rate-distortion functions (corresp.). IEEE Transactions on Information Theory, 20(1):122–124, 1974.
  10. Sanjoy Dasgupta. The hardness of k-means clustering. 2008.
  11. L Davisson. Rate distortion theory: A mathematical basis for data compression. IEEE Transactions on Communications, 20(6):1202–1202, 1972.
  12. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  13. Convergence of the lloyd algorithm for computing centroidal voronoi tessellations. SIAM journal on numerical analysis, 44(1):102–119, 2006.
  14. Centroidal voronoi tessellations: Applications and algorithms. SIAM Review, 41(4):637–676, 1999.
  15. Uci machine learning repository. 2017.
  16. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM), 19(2):248–264, 1972.
  17. Nondegeneracy and weak global convergence of the lloyd algorithm in ℝdsuperscriptℝ𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. SIAM Journal on Numerical Analysis, 46(3):1423–1441, 2008.
  18. Diversenet: When one right answer is not enough. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5598–5607, 2018.
  19. How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93:95–112, 2019.
  20. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
  21. Distillation multiple choice learning for multimodal action recognition. In WACV, pages 2755–2764, 2021.
  22. Phase transitions in stochastic self-organizing maps. Physical Review E, 56(4):3876, 1997.
  23. Robert M Gray. Source coding theory, volume 83. Springer Science & Business Media, 1989.
  24. Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems, 25, 2012.
  25. Bruce Hajek. Cooling schedules for optimal annealing. Mathematics of operations research, 13(2):311–329, 1988.
  26. Card: Classification and regression diffusion models. Advances in Neural Information Processing Systems, 35:18100–18115, 2022.
  27. W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  28. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International conference on machine learning, pages 1861–1869. PMLR, 2015.
  29. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 31–35. IEEE, 2016.
  30. Singing-voice separation from monaural recordings using robust principal component analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60. IEEE, 2012.
  31. Mikaela Iacobelli. Asymptotic quantization for probability measures on riemannian manifolds. ESAIM: Control, Optimisation and Calculus of Variations, 22(3):770–785, 2016.
  32. Uncertainty estimates and multi-hypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), pages 652–667, 2018.
  33. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
  34. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
  35. Tobias Koch. The shannon lower bound is asymptotically tight. IEEE Transactions on Information Theory, 62(11):6155–6161, 2016.
  36. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(10):1901–1913, 2017.
  37. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  38. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019.
  39. Confident multiple choice learning. In International Conference on Machine Learning, pages 2014–2023. PMLR, 2017.
  40. Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems, 29, 2016.
  41. Resilient multiple choice learning: A learned scoring scheme with application to audio scene analysis. Advances in neural information processing systems, 36, 2024.
  42. Winner-takes-all learners are geometry-aware conditional density estimators. In International Conference on Machine Learning, Vienna, Austria, July 2024.
  43. Espnet-se: End-to-end speech enhancement and separation toolkit designed for asr integration. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 785–792. IEEE, 2021.
  44. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  45. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 46–50. IEEE, 2020.
  46. Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
  47. The planar k-means problem is np-hard. Theoretical Computer Science, 442:13–21, 2012.
  48. Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7153, 2019.
  49. Automatic speech recognition in cocktail-party situations: A specific training for separated speech. The Journal of the Acoustical Society of America, 131(2):1529–1535, 2012.
  50. Neri Merhav. Rate–distortion function via minimum mean square error estimation. IEEE transactions on information theory, 57(6):3196–3206, 2011.
  51. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  52. Divide-and-conquer for lane-aware diverse trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15799–15808, 2021.
  53. Optimal quadratic quantization for numerics: the gaussian case. Monte Carlo Methods Appl., 9(2):135–165, 2003.
  54. Pointwise convergence of the lloyd i algorithm in higher dimension. SIAM Journal on Control and Optimization, 54(5):2354–2382, 2016.
  55. An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10):1027–1040, 1999.
  56. Kenneth Rose. A mapping approach to rate-distortion computation and analysis. IEEE Transactions on Information Theory, 40(6):1939–1952, 1994.
  57. Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239, 1998.
  58. Statistical mechanics and phase transitions in clustering. Physical review letters, 65(8):945, 1990.
  59. Vector quantization by deterministic annealing. IEEE Transactions on Information theory, 38(4):1249–1257, 1992.
  60. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE international conference on computer vision, pages 3591–3600, 2017.
  61. Claude E Shannon et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959.
  62. Hideyuki Tachibana. Towards listening to 10 people simultaneously: An efficient permutation invariant training of audio source separation using sinkhorn’s algorithm. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 491–495. IEEE, 2021.
  63. Versatile multiple choice learning and its application to vision computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6349–6357, 2019.
  64. Sudo rm-rf: Efficient networks for universal audio source separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2020.
  65. Andrea Vattani. K-means requires exponentially many iterations even in the plane. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 324–332, 2009.
  66. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006.
  67. All-neural online source separation, counting, and diarization for meeting analysis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 91–95. IEEE, 2019.
  68. Supervised speech separation based on deep learning: An overview. IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702–1726, 2018.
Citations (2)

Summary

We haven't generated a summary for this paper yet.