Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision (2403.12981v1)

Published 2 Mar 2024 in cs.DC, cs.AI, cs.CV, and cs.LG

Abstract: Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision application contains more than just DNN inference, including input decompression, resizing, sampling, normalization, and data transfer. In this paper, we perform a thorough evaluation of computer vision inference requests performed on a throughput-optimized serving system. We quantify the performance impact of server overheads such as data movement, preprocessing, and message brokers between two DNNs producing outputs at different rates. Our empirical analysis encompasses many computer vision tasks including image classification, segmentation, detection, depth-estimation, and more complex processing pipelines with multiple DNNs. Our results consistently demonstrate that end-to-end application performance can easily be dominated by data processing and data movement functions (up to 56% of end-to-end latency in a medium-sized image, and $\sim$ 80% impact on system throughput in a large image), even though these functions have been conventionally overlooked in deep learning system design. Our work identifies important performance bottlenecks in different application scenarios, achieves 2.25$\times$ better throughput compared to prior work, and paves the way for more holistic deep learning system design.

References (19)

Summary

The paper reveals that non-inference tasks, like preprocessing and data transfers, contribute up to 56% of total latency and reduce throughput by approximately 80%.
The study employs empirical evaluations on varied computer vision models, including dual-DNN pipelines, to quantify the impact of overlooked server overheads.
The analysis advocates designing balanced architectures and dynamic batching strategies to optimize both inference and preprocessing performance.

Performance Analysis of DNN Server Overheads in Computer Vision Applications

The paper, "Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision," offers a comprehensive examination of the overlooked components in deep neural network (DNN) deployment, specifically in the context of computer vision tasks. The critical focus is not only on the inference performance of models but also on the complete lifecycle of requests in a serving system, which includes substantial preprocessing and data movement tasks that can significantly influence the end-to-end performance.

A core finding from the paper highlights that while advances in GPUs and TPUs have improved DNN inference times, they do not necessarily translate to overall system efficiency in real-world deployments. Upon execution across different computer vision models—from image classification to complex multi-DNN pipelines—the authors identified non-inference tasks such as input decompression, resizing, and data transfer as major contributors to system latency and throughput degradation. The insights drawn reveal that these overheads can account for an average of 56% of the end-to-end latency in processing medium-sized images and negatively impact system throughput by approximately 80% for large images.

The empirical results dissect the latency distribution in optimized server configurations, showing that preprocessing can dominate latency metrics, particularly for richer multimedia input sizes. For instance, the usage of NVIDIA’s DALI library for offloading tasks to GPUs has been shown to enhance throughput and decrease energy consumption, highlighting preprocessing’s critical role in holistic system optimization. Moreover, the breakdown of latency under various loads demonstrates significant queuing overheads, making a robust case for dynamic batching and careful orchestration of server parameters like concurrency levels and batch sizes.

A particularly notable contribution is the exploration of a dual-DNN system for face detection and subsequently face identification. When intermediary data flow is managed by message brokers, such as Apache Kafka and Redis, the authors demonstrate that Redis significantly reduces latency due to its in-memory operations, achieving a 2.25-fold improvement in throughput compared to prior assessments utilizing disk-based Kafka systems.

The paper also casts light on vital hardware implications, particularly concerning multi-GPU environments. The performance evaluations underline that while increasing the number of GPUs enhances system throughput when inference is the bottleneck, the benefits plateau if preprocessing remains the limiting factor. This underscores the necessity for balanced system architectures that address both computing and data-processing tasks collectively.

From a theoretical standpoint, the findings align with Amdahl's Law, reinforcing that optimizing a singular component, such as inference, yields diminishing returns unless associated subsystems are concurrently enhanced. Practically, these insights advocate for the redesign of serving architectures to encompass efficient data preprocessing capabilities, potentially necessitating hardware innovations or leveraging software suites like TensorRT for model optimizations.

Looking forward, the paper suggests potential advancements in DNN infrastructure, indicating the importance of alternative hardware such as TPUs or custom ASICs to further expedite preprocessing and data movement tasks. Additionally, the paper opens avenues for exploring novel deployment strategies that could integrate seamlessly with existing hardware setups, paving the path for comprehensive end-to-end optimizations in AI workloads. This pivot from a purely model-centric viewpoint to a systems-level perspective is pivotal, offering profound implications for future infrastructural developments in AI deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mohsaied/status/1773430442378568152

https://twitter.com/spatialmlnet/status/1772329919432827372

YouTube

Show All Videos