Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predicting the Computational Cost of Deep Learning Models (1811.11880v1)

Published 28 Nov 2018 in cs.LG, cs.AI, and stat.ML

Abstract: Deep learning is rapidly becoming a go-to tool for many artificial intelligence problems due to its ability to outperform other approaches and even humans at many problems. Despite its popularity we are still unable to accurately predict the time it will take to train a deep learning network to solve a given problem. This training time can be seen as the product of the training time per epoch and the number of epochs which need to be performed to reach the desired level of accuracy. Some work has been carried out to predict the training time for an epoch -- most have been based around the assumption that the training time is linearly related to the number of floating point operations required. However, this relationship is not true and becomes exacerbated in cases where other activities start to dominate the execution time. Such as the time to load data from memory or loss of performance due to non-optimal parallel execution. In this work we propose an alternative approach in which we train a deep learning network to predict the execution time for parts of a deep learning network. Timings for these individual parts can then be combined to provide a prediction for the whole execution time. This has advantages over linear approaches as it can model more complex scenarios. But, also, it has the ability to predict execution times for scenarios unseen in the training data. Therefore, our approach can be used not only to infer the execution time for a batch, or entire epoch, but it can also support making a well-informed choice for the appropriate hardware and model.

Citations (207)

Summary

  • The paper introduces a deep learning-based predictor that estimates training times using per-layer decomposition and comprehensive hardware features.
  • It employs a fully connected neural network to capture non-linear interactions that traditional linear models overlook.
  • Experiments across multiple GPUs demonstrate reduced prediction errors, enabling more efficient resource allocation in deep learning projects.

Predicting the Computational Cost of Deep Learning Models

The paper "Predicting the Computational Cost of Deep Learning Models" addresses a critical yet often overlooked aspect of deep learning: the computational cost associated with training these models. While deep learning models demonstrate exceptional performance on a variety of tasks, the computational resources required for training are substantial, impacting financial and time budgets significantly. Despite the advances in hardware and distributed computing, there exists a notable challenge in accurately predicting the training time for deep learning networks, thus impeding efficient resource allocation and cost management.

Key Contributions

This work proposes an innovative method where a deep learning model itself is used to predict the execution time of training other deep learning networks. This model is trained on a wealth of features extracted from the computational resources and the specific network and data configurations being used. The proposed approach provides several advantages over traditional linear models, which tend to rely on floating-point operation counts as the primary determinant of execution time. The authors argue that such linear models fall short, as they fail to consider non-linear interactions such as those arising from memory bandwidth limitations or non-ideal parallel execution.

Methodology

The authors decompose the execution time prediction into predictions for individual components—typically layers—and recombine these per-layer predictions to estimate the total execution time for a training epoch. This method accounts for the non-linear characteristics of hardware-software interaction through the use of a fully connected feed-forward neural network that learns to predict these intricate interactions from training data. Key features include layer-specific characteristics (e.g., number of inputs and outputs, activation functions), hardware characteristics (e.g., GPU architecture, memory bandwidth), and training dynamics (e.g., batch size, optimizer).

Evaluation

Experiments are conducted on various GPUs, encompassing architectures like NVIDIA’s V100, P100, and K80, among others. The paper provides empirical evidence that their model not only predicts execution time with lower error margins compared to linear regression models but also generalizes to unseen hardware configurations. The model achieves this by incorporating various hardware-specific features into the prediction framework, thereby capturing subtleties that linear models cannot.

Implications and Future Work

The immediate practical implication of this work is its ability to inform more cost-effective resource selections and optimized scheduling in cloud environments, where the financial implications of training are particularly pronounced. On a theoretical level, this model presents a nuanced understanding of hardware-software interactions, stressing the need for holistic models that consider architecture-specific intricacies and diverse training environments.

Looking ahead, further training on a wider array of GPUs and a richer feature space is likely to enhance the model’s predictive power and generalization ability. Extending the approach to accommodate other machine learning frameworks, such as PyTorch and Jax, and incorporating capabilities for inferencing phase predictions could prove beneficial. Given the modularity and extensibility noted in their approach, there remains a clear path for open collaboration, allowing the method to evolve as more data and diverse hardware configurations are collected and incorporated. As the ecosystem around deep learning hardware continues to grow, such predictive models could become central in balancing performance with cost, optimizing AI deployments significantly.