Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems (1903.03129v2)

Published 7 Mar 2019 in cs.DC and cs.LG

Abstract: Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.

Citations (99)

Summary

  • The paper presents SLIDE, an algorithmic approach that uses multi-core CPU parallelism and adaptive LSH-based sparsification to achieve over 3.5x training speedup compared to optimized GPU implementations.
  • The paper introduces an innovative adaptive dropout technique with intelligent neuron sampling that maintains model accuracy while significantly reducing computation.
  • The research highlights practical scalability and memory optimizations that democratize deep learning by reducing reliance on expensive specialized hardware.

Overview of SLIDE: In Defense of Smart Algorithms Over Hardware Acceleration for Large-Scale Deep Learning Systems

This paper introduces a novel approach to deep learning, proposing the Sub-LInear Deep learning Engine (SLIDE) as a compelling alternative to conventional hardware acceleration. The authors posit that while the deep learning community has heavily invested in specialized hardware to handle the immense computational demands of neural networks, such approaches can be costly and lack flexibility. SLIDE, instead, leverages innovative algorithmic strategies to outperform traditional hardware acceleration, such as NVIDIA's Tesla V100 GPUs.

Key Contributions

The primary contributions of this research include:

  1. Demonstrating that SLIDE, utilizing multi-core parallelism on standard CPUs, can surpass the computational capabilities of state-of-the-art GPUs in both training speed and inference efficiency. The authors present empirical evidence that SLIDE achieves faster training times without sacrificing accuracy, specifically achieving a more than 3.5x speedup over Tensorflow's most optimized GPU implementation.
  2. Developing a coherent system where Locality Sensitive Hashing (LSH) is used for adaptive neuron sparsification—a technique that intelligently reduces computational overhead. This is achieved without compromising deep learning model convergence or final performance.
  3. Offering a comprehensive evaluation across large-scale recommendation systems, underlining SLIDE's capacity to handle extreme classification tasks characterized by a vast number of classes and exceedingly large fully connected layers.
  4. Providing insights into the memory management and workload optimization aspects necessary to enhance SLIDE's implementation. The paper highlights optimizations like the use of Transparent Hugepages and SIMD instructions, which further increase SLIDE's performance.

Algorithmic Insights

SLIDE’s innovation lies in its use of adaptive dropout techniques combined with LSH, providing a substantial reduction in the number of active neurons processed during training. This is crucial in efficiently managing the workload on CPUs, which contrasts the fixed approach of GPUs that rely heavily on massive parallelism for all neurons.

Moreover, the potential for asynchrony in SLIDE's design allows it to thrive under HOGWILD!-style parallelism. This leads to near-ideal scaling with the number of CPU cores utilized, something traditional frameworks like Tensorflow on CPU struggle with due to their reliance on synchronized parameter updates.

Experimental Results

The experimental results reveal that SLIDE significantly improves training times on industry-scale datasets compared to both CPU and GPU-trained models using Tensorflow. This performance gain is due in part to intelligent neuron sampling that drastically reduces computational redundancy. Notably, SLIDE achieves this while maintaining, and at times, improving the convergence and model accuracy.

Implications and Future Work

The implications of this work suggest a reevaluation of emphasis within the machine learning community from hardware-centric approaches to innovative algorithmic frameworks. The practicality of deploying deep learning systems without dependence on specialized hardware could democratize access to these models, providing scalability and economic efficiency.

Looking forward, there's a clear path for this work to expand into more complex architectures, including those with convolutional elements, where the spatial and structural characteristics of data could benefit further from SLIDE's sparsification techniques. Additionally, distributed implementations could leverage SLIDE's strengths, particularly in environments where communication costs are a concern.

Conclusion

By successfully challenging the prevailing narrative that specialized hardware is necessary for efficient large-scale deep learning, this paper offers a bold algorithmic perspective. SLIDE demonstrates that with the right blend of data structures and adaptive algorithms, significant computational efficiency can be achieved with existing general-purpose hardware, representing a pivotal contribution to the field of scalable machine learning systems.

Youtube Logo Streamline Icon: https://streamlinehq.com