TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (1603.04467v2)

Published 14 Mar 2016 in cs.DC and cs.LG

Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

Authors (40)

Martín Abadi (14 papers)
Ashish Agarwal (6 papers)
Paul Barham (10 papers)
Eugene Brevdo (15 papers)
Zhifeng Chen (65 papers)
Craig Citro (3 papers)
Andy Davis (9 papers)
Jeffrey Dean (15 papers)
Matthieu Devin (3 papers)
Sanjay Ghemawat (7 papers)
Ian Goodfellow (54 papers)
Andrew Harp (1 paper)
Geoffrey Irving (31 papers)
Michael Isard (10 papers)
Yangqing Jia (17 papers)
Lukasz Kaiser (40 papers)
Manjunath Kudlur (8 papers)
Josh Levenberg (3 papers)
Dan Mane (2 papers)
Rajat Monga (12 papers)

Citations (10,882)

View on Semantic Scholar

Summary

The paper presents TensorFlow’s main contribution by introducing a flexible ML system that uses stateful dataflow graphs for executing computations on heterogeneous hardware.
It details a novel programming model with directed graphs, automatic gradient computation, and fault-tolerant distributed execution to enhance scalability and performance.
The paper demonstrates efficiency gains achieved through optimizations such as common subexpression elimination, optimized memory management, and lossy tensor compression.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

The paper "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems" authored by Abadi et al. presents an extensive overview of TensorFlow, an interface and implementation designed for expressing ML algorithms and executing them across various hardware configurations. This work emanates from the experiences and insights gained from Google’s earlier DistBelief system, which undertook scalable distributed training and inference.

Overview of TensorFlow

TensorFlow is introduced as a highly flexible system that can seamlessly operate across a spectrum of hardware platforms, from mobile devices to extensive distributed systems encompassing numerous GPUs. Its core feature is the ability to express computations through stateful dataflow graphs, which can then be executed on heterogeneous systems without significant alterations. This flexibility not only simplifies real-world ML application deployments but also enhances experimentation with new models.

Key Components and Capabilities

Programming Model:
- Graphs: TensorFlow computations are defined as directed graphs where nodes represent operations and edges denote data flow (tensors).
- Operations and Kernels: Operations are abstract computations (e.g., matrix multiplication), with kernels providing device-specific implementations.
- Sessions and Variables: Sessions execute parts of these computation graphs. Variables allow persistence of state across graph executions, crucial for ML model parameters.
- Gradient Computation: TensorFlow automatically computes gradients necessary for optimization algorithms like stochastic gradient descent (SGD).
Implementation Architecture:
- Devices: Computation is performed on devices (CPUs, GPUs), each managed by worker processes.
- Execution: TensorFlow supports both single-device and multi-device (local and distributed) execution. The system partitions computation graphs into subgraphs for each device, handling inter-device communication through Send/Receive nodes.
- Fault Tolerance: Distributed execution includes fault detection, with mechanisms like periodic health-checks and consistent checkpointing/recovery ensuring robustness.
Extensions and Advanced Features:
- Control Flow: TensorFlow supports conditionals (if-conditions), loops (while-loops), and iteration constructs, enhancing the expressiveness and efficiency of ML models.
- Partial Execution: Subgraphs can be selectively executed, with flexible data injection and retrieval.
- Queues: Queues facilitate asynchronous operation execution and data handoff, improving throughput and efficiency in data pipelines.
- Device Placement Constraints: Users can specify device constraints to guide TensorFlow’s internal node placement algorithm.

Optimization Techniques

The paper details several optimizations aimed at enhancing the performance and resource efficiency of TensorFlow:

Common Subexpression Elimination: Redundancy in computation graphs is minimized by identifying and merging identical sub-expressions.
Execution and Memory Management: Techniques like as-soon-as-possible/as-late-as-possible scheduling of operations and optimized kernel implementations (e.g., leveraging BLAS, cuBLAS, cuDNN) are employed.
Compression: Lossy compression of tensor data during inter-device communication reduces data transfer overhead, a particularly beneficial tactic given the relatively noise-tolerant nature of neural network training.

Practical and Theoretical Implications

Practically, TensorFlow has transformed the landscape of machine learning by offering an open-source, highly scalable platform for both researchers and practitioners. Its broad applicability across different devices and flexibility in model experimentation has fostered significant advancements in various domains like computer vision, natural language processing, and reinforcement learning.

Theoretically, TensorFlow’s robust and flexible architecture sets a benchmark for future ML system designs. It bridges high-level model specification with low-level system execution, providing insights into effectively managing heterogeneity and scale in ML computing.

Future Developments

TensorFlow’s architecture paves the way for several prospective enhancements:

Function Mechanism: Introducing reusable subgraph components across different languages to facilitate cross-language research and application development.
Just-In-Time Compilation: Developing a JIT compiler for subgraph optimization based on real-time profiling.
Placement and Scheduling Learning: Implementing advanced learning-based algorithms to optimize node placement and execution scheduling could significantly improve resource utilization and execution performance.

Conclusion

The "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems" paper documents a comprehensive system that has revolutionized ML research and deployment. Through its flexible programming model, robust implementation, and thoughtful optimization techniques, TensorFlow has set a high standard in the field of distributed machine learning systems. The ongoing enhancements and open-source community contributions promise continued advancements and adaptations, making TensorFlow a cornerstone in the future development of artificial intelligence technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos