Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces (2306.15392v1)

Published 27 Jun 2023 in cs.LG and cs.AI

Abstract: In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels, we illustrate the profound impact of dataset quality on model training and performance. We further introduce two additional datasets designed to represent specific data conditions - one maximizing entropy and the other demonstrating high redundancy. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality in achieving high-performing machine learning models. To aid researchers and practitioners, we propose a comprehensive framework for dataset quality assessment, which can help evaluate if the dataset at hand is sufficient and of the required quality for specific tasks. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (7)
  1. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142.
  2. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO.
  3. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario.
  4. Canine age classification using deep learning as a step toward preventive medicine in animals. In 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), pages 169–172.
  5. Wu, J. (2017). Tiny imagenet challenge.
  6. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
  7. Automatic estimation of dog age: The dogage dataset and challenge. In Tetko, I. V., Kůrková, V., Karpov, P., and Theis, F., editors, Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing, pages 421–426. Springer International Publishing.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Szymon Mazurek (10 papers)
  2. Maciej Wielgosz (29 papers)

Summary

We haven't generated a summary for this paper yet.