Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Generating Multidimensional Clusters With Support Lines (2301.10327v3)

Published 24 Jan 2023 in cs.LG, cs.CV, and cs.PL

Abstract: Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. doi:10.36227/techrxiv.19091330.v1.
  2. doi:10.1109/TEVC.2021.3137369.
  3. doi:10.1016/j.cor.2015.04.022.
  4. doi:10.7939/R3B23S.
  5. doi:10.1016/j.simpa.2020.100017.
  6. doi:10.1145/3321707.3321761.
  7. doi:10.1016/j.ins.2013.08.059.
  8. doi:10.1007/s00357-019-9312-3.
  9. doi:10.1007/s00357-006-0018-y.
  10. doi:10.18637/jss.v051.i12.
  11. doi:10.1007/3-540-33019-4_2.
  12. doi:10.1007/s00357-005-0015-6.
  13. doi:10.1137/141000671.
  14. doi:10.1007/BF02294153.
  15. doi:10.1198/jcgs.2009.08054.
  16. doi:10.14778/2824032.2824115.
  17. doi:10.1016/0377-0427(87)90125-7.
  18. doi:10.1016/j.patrec.2014.03.008.
  19. doi:10.48550/arXiv.2303.14301.
  20. doi:10.1038/s41586-020-2649-2.
  21. N. Fachada, Supplementary materials for “generating multidimensional clusters with support lines”, Zenodo, accessed 24/01/2023, updated 24/01/2023 (2023). doi:10.5281/zenodo.7566684. URL https://doi.org/10.5281/zenodo.7566684
  22. doi:10.1007/978-0-387-84858-7.
  23. doi:10.1109/TIT.1982.1056489.
  24. doi:10.1080/01969727308546046.
  25. doi:10.1093/bioinformatics/btq534.
  26. doi:10.1007/978-3-030-32047-8_16.
  27. doi:10.1002/9780470316801.ch2.
  28. doi:10.7916/D80V8N84.
  29. doi:10.48550/ARXIV.1905.05667. URL https://arxiv.org/abs/1905.05667
  30. doi:10.1007/978-3-540-45167-9_14.
  31. doi:10.1109/91.413225.
  32. doi:10.1007/s11432-014-5146-0.
  33. doi:10.5120/ijca2016907841.
  34. doi:10.1142/S012918312350002X.
  35. doi:10.1109/ACCESS.2019.2899323.
  36. doi:10.1016/j.jpdc.2019.10.008. URL http://www.sciencedirect.com/science/article/pii/S0743731519300887
  37. doi:10.1109/ISCMI47871.2019.9004300.
  38. doi:10.1109/ISCMI51676.2020.9311598.
  39. doi:10.1145/3338533.3366593.
  40. doi:10.1049/wss2.12036.
  41. doi:10.1016/j.ecolmodel.2018.05.008.
  42. doi:10.3390/math10193528.
  43. doi:10.5334/jors.431.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.