Selection via Proxy: Efficient Data Selection for Deep Learning (1906.11829v4)

Published 26 Jun 2019 in cs.LG and stat.ML

Abstract: Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%). For core-set selection on CIFAR10, proxies that are over 10x faster to train than their larger, more accurate targets can remove up to 50% of the data without harming the final accuracy of the target, leading to a 1.6x end-to-end training time improvement.

PDF Abstract

Efficient Data Selection for Deep Learning: A Proxy-Based Approach

The paper "Selection via Proxy: Efficient Data Selection for Deep Learning" presents an innovative approach, dubbed Selection via Proxy (SVP), for enhancing computational efficiency in data selection processes within deep learning frameworks. By leveraging small proxy models to perform data selection tasks, such as active learning and core-set selection, the SVP approach significantly reduces computation time while maintaining model accuracy comparable to substantially larger models.

Summary of Key Contributions

Active learning and core-set selection serve as pivotal methodologies in selecting the most informative subsets of large datasets, thus improving data efficiency in machine learning models. However, the application of these methods to deep learning has been hindered by their substantial computational costs, primarily because they require feature representations that are typically derived from computationally expensive large models. The SVP approach addresses this challenge by introducing small proxy models to drive data selection.

Selection via Proxy Methodology:

Proxy Models: These are smaller, computationally inexpensive models achieved by removing hidden layers, using simpler architectures, and reducing training epochs. Despite their higher error rates, these proxy models still provide useful signals for data selection.
Application to Tasks: SVP is applied to various data selection tasks using datasets such as CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. It demonstrates the approach's efficacy in active learning and core-set selection without a significant increase in error rates.

Empirical Evidences:

For active learning, SVP accelerates data selection runtime by an order of magnitude while often keeping the error increase within 0.1%.
In core-set selection on CIFAR10, proxy models that are over 10x faster to train successfully filter out up to 50% of the data without degrading the final accuracy of the target model. This leads to a 1.6x improvement in end-to-end training time.

Implications and Future Directions

The results presented in this work provide substantial evidence for the utility of proxy models in deep learning data selection. A major practical implication of SVP is its potential to reduce computational costs dramatically, which is particularly beneficial in environments with limited computing resources. Theoretically, this work contributes to the understanding of how model representations, regardless of model accuracy, can provide valuable insights for data selection in complex deep learning tasks.

Looking forward, further developments could consider exploring alternative architectures and configurations for proxy models to refine and extend the applicability of the SVP approach. Additionally, extending this framework to other domains in machine learning beyond classification tasks could further establish its universal applicability and effectiveness. Integrating SVP with different data augmentation and regularization techniques might also improve the adaptability and generalization of deep learning models when selecting data efficiently.

In conclusion, the SVP approach demonstrates a promising step toward making deep learning more accessible and efficient, thus advancing the capability for real-time and resource-constrained applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Cody Coleman (10 papers)
Christopher Yeh (9 papers)
Stephen Mussmann (15 papers)
Baharan Mirzasoleiman (51 papers)
Peter Bailis (44 papers)
Percy Liang (239 papers)
Jure Leskovec (233 papers)
Matei Zaharia (101 papers)

Citations (296)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/knishimae0531/status/1764507677798113390