Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision (2403.12981v1)

Published 2 Mar 2024 in cs.DC, cs.AI, cs.CV, and cs.LG

Abstract: Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision application contains more than just DNN inference, including input decompression, resizing, sampling, normalization, and data transfer. In this paper, we perform a thorough evaluation of computer vision inference requests performed on a throughput-optimized serving system. We quantify the performance impact of server overheads such as data movement, preprocessing, and message brokers between two DNNs producing outputs at different rates. Our empirical analysis encompasses many computer vision tasks including image classification, segmentation, detection, depth-estimation, and more complex processing pipelines with multiple DNNs. Our results consistently demonstrate that end-to-end application performance can easily be dominated by data processing and data movement functions (up to 56% of end-to-end latency in a medium-sized image, and $\sim$ 80% impact on system throughput in a large image), even though these functions have been conventionally overlooked in deep learning system design. Our work identifies important performance bottlenecks in different application scenarios, achieves 2.25$\times$ better throughput compared to prior work, and paves the way for more holistic deep learning system design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. 2023. https://blog.youtube/news-and-events/using-technology-more-consistently-apply-age-restrictions/
  2. 2023a. https://developer.nvidia.com/blog/leveraging-hardware-jpeg-decoder-and-nvjpeg-on-a100/
  3. 2023. AI Matrix. https://aimatrix.ai/en-us
  4. 2023. Business Insider: Facebook Users Are Uploading 350 Million New Photos Each Day. https://www.businessinsider.com/facebook-350-million-photos-each-day-2013-9
  5. 2023. ChatGPT sets record for fastest-growing user base - analyst note. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
  6. 2023b. NVIDIA Data Loading Library (DALI). https://developer.nvidia.com/dali
  7. Mohamed S Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117.
  8. Robert Adolf et al. 2016. Fathom: Reference workloads for modern deep learning methods. In 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 1–10.
  9. Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference. 483–485.
  10. Wesley Brewer et al. 2020a. iBench: a distributed inference simulation and benchmark suite. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–6.
  11. Wesley Brewer et al. 2020b. Inference benchmarking on HPC systems. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.
  12. Cody Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training 100, 101 (2017), 102.
  13. Alexey Dosovitskiy et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  14. Amin Firoozshahian et al. 2023. MTIA: First Generation Silicon Targeting Meta’s Recommendation Systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–13.
  15. Norman P. Jouppi et al. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. arXiv:2304.01433 [cs.AR]
  16. Vijay Janapa Reddi et al. 2020. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446–459.
  17. Daniel Richins et al. 2021. AI tax: The hidden cost of AI data center applications. ACM Transactions on Computer Systems (TOCS) 37, 1-4 (2021), 1–32.
  18. Huaizheng Zhang et al. 2020. Inferbench: Understanding deep learning inference serving with an automatic benchmarking system. arXiv preprint arXiv:2011.02327 (2020).
  19. Hongbin Zheng et al. 2020. Optimizing memory-access patterns for deep learning accelerators. arXiv preprint arXiv:2002.12798 (2020).

Summary

  • The paper reveals that non-inference tasks, like preprocessing and data transfers, contribute up to 56% of total latency and reduce throughput by approximately 80%.
  • The study employs empirical evaluations on varied computer vision models, including dual-DNN pipelines, to quantify the impact of overlooked server overheads.
  • The analysis advocates designing balanced architectures and dynamic batching strategies to optimize both inference and preprocessing performance.

Performance Analysis of DNN Server Overheads in Computer Vision Applications

The paper, "Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision," offers a comprehensive examination of the overlooked components in deep neural network (DNN) deployment, specifically in the context of computer vision tasks. The critical focus is not only on the inference performance of models but also on the complete lifecycle of requests in a serving system, which includes substantial preprocessing and data movement tasks that can significantly influence the end-to-end performance.

A core finding from the paper highlights that while advances in GPUs and TPUs have improved DNN inference times, they do not necessarily translate to overall system efficiency in real-world deployments. Upon execution across different computer vision models—from image classification to complex multi-DNN pipelines—the authors identified non-inference tasks such as input decompression, resizing, and data transfer as major contributors to system latency and throughput degradation. The insights drawn reveal that these overheads can account for an average of 56% of the end-to-end latency in processing medium-sized images and negatively impact system throughput by approximately 80% for large images.

The empirical results dissect the latency distribution in optimized server configurations, showing that preprocessing can dominate latency metrics, particularly for richer multimedia input sizes. For instance, the usage of NVIDIA’s DALI library for offloading tasks to GPUs has been shown to enhance throughput and decrease energy consumption, highlighting preprocessing’s critical role in holistic system optimization. Moreover, the breakdown of latency under various loads demonstrates significant queuing overheads, making a robust case for dynamic batching and careful orchestration of server parameters like concurrency levels and batch sizes.

A particularly notable contribution is the exploration of a dual-DNN system for face detection and subsequently face identification. When intermediary data flow is managed by message brokers, such as Apache Kafka and Redis, the authors demonstrate that Redis significantly reduces latency due to its in-memory operations, achieving a 2.25-fold improvement in throughput compared to prior assessments utilizing disk-based Kafka systems.

The paper also casts light on vital hardware implications, particularly concerning multi-GPU environments. The performance evaluations underline that while increasing the number of GPUs enhances system throughput when inference is the bottleneck, the benefits plateau if preprocessing remains the limiting factor. This underscores the necessity for balanced system architectures that address both computing and data-processing tasks collectively.

From a theoretical standpoint, the findings align with Amdahl's Law, reinforcing that optimizing a singular component, such as inference, yields diminishing returns unless associated subsystems are concurrently enhanced. Practically, these insights advocate for the redesign of serving architectures to encompass efficient data preprocessing capabilities, potentially necessitating hardware innovations or leveraging software suites like TensorRT for model optimizations.

Looking forward, the paper suggests potential advancements in DNN infrastructure, indicating the importance of alternative hardware such as TPUs or custom ASICs to further expedite preprocessing and data movement tasks. Additionally, the paper opens avenues for exploring novel deployment strategies that could integrate seamlessly with existing hardware setups, paving the path for comprehensive end-to-end optimizations in AI workloads. This pivot from a purely model-centric viewpoint to a systems-level perspective is pivotal, offering profound implications for future infrastructural developments in AI deployments.

Youtube Logo Streamline Icon: https://streamlinehq.com