TensorFlow-Serving: Flexible, High-Performance ML Serving (1712.06139v2)

Published 17 Dec 2017 in cs.DC and cs.LG

Abstract: We describe TensorFlow-Serving, a system to serve machine learning models inside Google which is also available in the cloud and via open-source. It is extremely flexible in terms of the types of ML platforms it supports, and ways to integrate with systems that convey new models and updated versions from training to serving. At the same time, the core code paths around model lookup and inference have been carefully optimized to avoid performance pitfalls observed in naive implementations. Google uses it in many production deployments, including a multi-tenant model hosting service called TFS^2.

Authors (9)

Christopher Olston (2 papers)
Noah Fiedel (22 papers)
Kiril Gorovoy (1 paper)
Jeremiah Harmsen (7 papers)
Li Lao (3 papers)
Fangwei Li (1 paper)
Vinu Rajashekhar (2 papers)
Sukriti Ramesh (1 paper)
Jordan Soyke (1 paper)

Citations (284)

View on Semantic Scholar

Summary

The paper presents a modular architecture that integrates flexible APIs with dynamic model lifecycle management for efficient ML deployment.
The paper details optimized inference methods using batching, hardware acceleration, and thread isolation to reduce latency and boost throughput.
The paper demonstrates practical scalability with robust version management and operational benchmarks validating its production-grade performance.

An Analysis of TensorFlow-Serving: Flexible, High-Performance ML Model Deployment

The research paper presents an in-depth examination of TensorFlow-Serving, a model-serving architecture specifically designed for deploying ML models at Google. TensorFlow-Serving distinguishes itself by offering flexibility in handling various ML platforms and integrating multiple systems to efficiently transition models from training to serving. Built with high-performance requirements in mind, its core pathways for model lookup and inference have been optimized to mitigate common performance issues in initial implementations.

Architectural Insights and Core Components

TensorFlow-Serving is structured into three primary layers: a C++ library, a canonical server binary, and a hosted service. The library provides APIs and modules for constructing an ML server, emphasizing modularity through well-designed APIs. This layered architecture facilitates custom configurations and supports different ML models, not exclusively those developed using TensorFlow, but potentially any ML framework through abstraction mechanisms.

Model Lifecycle Management is central to TensorFlow-Serving, encapsulating the discovery, loading, and transition of model versions. It utilizes a chain of modules—sources, source routers, source adapters, and managers—to manage model lifecycle optimally. Custom implementations of these modules allow for further specialization based on the use-case scenarios.

A key feature of the model lifecycle management is the Aspired Versions Manager (AVM), which governs model version transition policies, balancing between availability and resource constraints. It ensures smooth version transitions, either by maintaining availability or optimizing resource usage. Noteworthy are the system's optimization techniques, including read-copy-update data structures and thread isolation, which contribute to maintaining low latency and high throughput.

Inference in TensorFlow-Serving accommodates both low-level and high-level API interfaces. It incorporates efficient mechanisms for inter-request batching, benefiting from hardware acceleration such as GPUs and TPUs. Batching lattens latency bottlenecks, enhancing the serving system's throughput capabilities.

Canonical Binary and Hosted Service

Beyond software library provisions, TensorFlow-Serving facilitates model deployments using a canonical binary and a hosted service known as TFS\textsuperscript{2}. While the binary enables typical use-case applications with standard set-ups, TFS\textsuperscript{2} elevates the service abstraction, managing model deployment tasks automatically. Within TFS\textsuperscript{2}, multiple dynamic factors such as model version monitoring, canary testing, and rollback capabilities are supported. The system architecture leverages a hybrid setup, with orchestration handled using Google's globally-replicated database system, Spanner, thus ensuring robust service availability.

Performance Evaluation and Implications

Upon analysis, TensorFlow-Serving demonstrates significant proficiency concerning throughput and latency management. The authors report operational benchmarks indicating substantial scale capabilities, underscored by TensorFlow-Serving's ability to handle high request rates in production environments within Google, pointing towards substantial usage across various business units.

From a practical perspective, TensorFlow-Serving's design paradigm could inform the construction of ML model serving systems across different organizational contexts. By separating lifecycle management from inference components and utilizing modular APIs, the architecture facilitates extensibility and adaptability to evolving use-cases. Additionally, the batching techniques and resource management principles could be transferred and adapted to other infrastructure projects needing high-volume, low-latency model serving capabilities.

Future Speculations

Looking forward, the authors' work suggests several potential research trajectories and applications. The abstraction principles and modular approaches could inspire advancements beyond TensorFlow and Google's ecosystem, potentially influencing cloud service providers and enterprise ML applications to adopt similar serving frameworks. Furthermore, exploring novel batching strategies and extending TFS\textsuperscript{2} capabilities could further push the boundaries of efficient ML model serving, particularly in distributed and low-latency-sensitive environments.

Overall, TensorFlow-Serving offers a comprehensive and powerful model-serving solution, equipped to handle the complexities and demands of large-scale ML applications. While centered around Google's internal and external requirements, the principles observed here have clear implications for the broader ML landscape, setting a path towards more efficient, flexible, and scalable ML server architectures.