Papers
Topics
Authors
Recent
2000 character limit reached

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

Published 19 Dec 2022 in cs.LG and cs.PL | (2212.09290v1)

Abstract: Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.

Citations (4)

Summary

  • The paper introduces XEngine, which leverages MIQP to efficiently schedule tensor rematerialization across CPUs and GPUs.
  • The framework details a novel operator scheduling method that reduces memory usage and cuts computation time by up to 22.5% compared to GPU-only approaches.
  • XEngine enhances resource efficiency in deep learning, paving the way for scalable and energy-efficient model training on constrained devices.

XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

Overview

The paper introduces XEngine, a framework for tensor rematerialization that aims for optimal scheduling in heterogeneous environments comprising CPUs and GPUs. This approach is distinguished by its use of mixed-integer quadratic programming (MIQP) to optimize the end-to-end execution time of neural networks (NNs), accommodating constraints on memory and compute resources. Significant improvements over the existing Checkmate framework are demonstrated, particularly under conditions that require dynamic distribution of computational loads across multiple devices.

Framework Implementation

XEngine's core function is to manage tensor rematerialization in resource-constrained environments. Unlike prior methods that focus on a single device, XEngine considers multiple heterogeneous devices. The framework schedules network operators by determining optimal checkpoints for recomputation, reducing memory pressure by recalculating some forward-pass activations when needed for backpropagation.

  1. Problem Formulation in MIQP:
    • Operators are scheduled with consideration of both memory and compute constraints.
    • The scheduling problem is formalized using MIQP to handle device-specific constraints, unlike the single-device MILP of Checkmate. This allows for simultaneous execution of network operations across multiple devices (e.g., CPU and GPU), a feature not supported by previous frameworks.
  2. Performance Considerations:
    • The MIQP approach allows for a more granular optimization that includes cost terms for memory and compute transfer.
    • Experiments on test networks like VGG, ResNet, and UNet show up to 22.5% reduction in computation time using CPU/GPU schedules, compared to GPU-only schedules.
  3. Software and Hardware Considerations:
    • The implementation leverages Intel oneDNN for efficient operator execution on Intel hardware, though the underlying concepts are adaptable to other platforms.
    • Resource profiles, including tensor sizes and compute costs, are gathered in an offline step, which allows XEngine to optimize schedules with detailed knowledge of each operations' resource requirements.

Experimental Results

  1. Test Networks: Experiments were conducted on different network architectures including VGG16, ResNet18, and UNet, varying in complexity and tensor operations.
  2. Heterogeneous Setup: Results show consistent reductions in computation times by intelligently redistributing computations across CPUs and GPUs, optimizing for constraints specific to each setup.
  3. Memory Utilization: XEngine achieves lower memory usage by incorporating recomputation strategies (i.e., rematerialization) making it possible to fit larger models within constrained resources without significant computational overhead.
  4. Comparison with Checkmate: Against Checkmate, a single-device MILP solution, XEngine offers a distinct advantage, especially for larger networks where exclusive reliance on GPU poses memory limitations.

Implications and Future Work

  1. Broader Applicability: XEngine has significant implications for enabling high-performance training of deep learning models in memory-constrained environments without sacrificing performance, which is critical for edge computing and deployment in mobile applications.
  2. Energy Efficiency Extensions: The paper hints at future avenues for extending the MIQP framework to incorporate energy efficiency constraints, vital for optimizing deployments in energy-limited environments such as mobile devices.
  3. Scalability and Adaptation: While current implementations focus primarily on static graphs, expanding to adaptive and dynamic graph structures could provide further benefits, enabling real-time adjustments and scalability in variable workloads.

Conclusion

XEngine provides a robust framework for the composite problem of network operator scheduling in heterogeneous environments by leveraging MIQP. Its potential to significantly accelerate neural network training and inference by optimizing both memory consumption and computational distribution presents a valuable advancement in resource-efficient deep learning practices. This approach makes computationally expensive tasks more feasible in constrained environments, offering a pathway to deploy complex models on diverse hardware configurations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.