FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs (1604.06414v4)

Published 21 Apr 2016 in cs.DC

Abstract: R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory capacity by utilizing solid-state drives (SSDs) automatically. It provides a small number of generalized operations (GenOps) upon which we reimplement a large number of matrix functions in the R base package. As such, FlashR parallelizes and scales existing R code with little/no modification. To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning. We evaluate FlashR on a variety of machine learning and statistics algorithms on inputs of up to four billion data points. FlashR out-of-core tracks closely the performance of FlashR in-memory. The R code for machine learning algorithms executed in FlashR outperforms the in-memory execution of H2O and Spark MLlib by a factor of 2-10 and outperforms Revolution R Open by more than an order of magnitude.

Citations (6)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs (1604.06414v4)

Collections

Summary

Follow-up Questions

Authors (5)

Don't miss out on important new AI/ML research

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs (1604.06414v4)

Collections

Summary

Follow-up Questions

Related Papers

Authors (5)

Don't miss out on important new AI/ML research