Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ResBit: Residual Bit Vector for Categorical Values (2309.17196v4)

Published 29 Sep 2023 in cs.LG

Abstract: One-hot vectors, a common method for representing discrete/categorical data, in machine learning are widely used because of their simplicity and intuitiveness. However, one-hot vectors suffer from a linear increase in dimensionality, posing computational and memory challenges, especially when dealing with datasets containing numerous categories. In this paper, we focus on tabular data generation, and reveal the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high. Moreover, due to the limitations of one-hot vectors, the training phase takes time longer in such a situation. To address these issues, we propose Residual Bit Vectors (ResBit), a technique for densely representing categorical data. ResBit is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation. Our experiments demonstrate that ResBit not only accelerates training but also maintains performance when compared with the situations before applying ResBit. Furthermore, our results indicate that many existing methods struggle with high-cardinality data, underscoring the need for lower-dimensional representations, such as ResBit and latent vectors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pp.  2623–2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi: 10.1145/3292500.3330701.
  2. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  290–306. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/alaa22a.html.
  3. Structured denoising diffusion models in discrete state-spaces. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.
  4. Food-101 – mining discriminative components with random forests. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, pp.  446–461, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10599-4.
  5. Breiman, L. Random forests. Mach. Learn., 45(1):5–32, oct 2001. ISSN 0885-6125. doi: 10.1023/A:1010933404324.
  6. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp.  785–794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785.
  7. Analog bits: Generating discrete data using diffusion models with self-conditioning. In The Eleventh International Conference on Learning Representations, 2023.
  8. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  2180–2188, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  10. Real-valued (medical) time series generation with recurrent conditional gans, 2017.
  11. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  12. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  13. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  770–778, 2016. doi: 10.1109/CVPR.2016.90.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  15. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  6840–6851. Curran Associates, Inc., 2020.
  16. Argmax flows and multinomial diffusion: Learning categorical distributions. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.
  17. Multiple stage vector quantization for speech coding. In ICASSP ’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 7, pp.  597–600, 1982. doi: 10.1109/ICASSP.1982.1171604.
  18. Scaling laws for neural language models, 2020.
  19. STasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  20. Adam: A method for stochastic optimization, 2017.
  21. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  22. Kohavi, R. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp.  202–207. AAAI Press, 1996.
  23. TabDDPM: Modelling tabular data with diffusion models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17564–17579. PMLR, 23–29 Jul 2023.
  24. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
  25. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  26. CoDi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  18940–18956. PMLR, 23–29 Jul 2023.
  27. Diffusion-LM improves controllable text generation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=3s9IrEsjLyk.
  28. GOGGLE: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=fPVRcJqspu.
  29. Conditional generative adversarial nets, 2014.
  30. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  31. Catboost: unbiased boosting with categorical features. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  32. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, June 2022.
  33. Collapse by conditioning: Training class-conditional GANs with limited data. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=7TZeCsNOUB_.
  34. Comment volume prediction using neural networks and decision trees. In Proceedings of the 2015 17th UKSIM-AMSS International Conference on Modelling and Simulation, UKSIM ’15, pp.  15–20, USA, 2015. IEEE Computer Society. ISBN 9781479987139.
  35. Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.  2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
  36. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  37. On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.  1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  38. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.
  39. Modeling tabular data using conditional gan. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  40. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
  41. Mixed-type tabular data synthesis with score-based diffusion in latent space, 2023.

Summary

We haven't generated a summary for this paper yet.