Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Debunking the CUDA Myth Towards GPU-based AI Systems (2501.00210v2)

Published 31 Dec 2024 in cs.DC, cs.AI, and cs.AR

Abstract: This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.

Summary

  • The paper shows that Gaudi’s reconfigurable Matrix Multiplication Engine achieves higher GEMM throughput than NVIDIA’s A100, proving its strength in compute-intensive tasks.
  • The paper finds that Gaudi delivers competitive energy efficiency and memory performance in LLM serving while underperforming on irregular memory accesses in RecSys workloads.
  • The paper identifies programmability challenges, notably limited low-level control over its MMEs, and suggests that enhancing developer visibility could bolster its optimization potential.

An Evaluation of Intel's Gaudi NPU for AI Model Serving

The paper "Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving" provides a comprehensive evaluation of the Intel Gaudi Neural Processing Unit (NPU) as an alternative to NVIDIA's GPUs within the context of AI model serving. Given the widespread adoption of NVIDIA GPUs, largely due to their CUDA software ecosystem, the paper seeks to assess whether the Gaudi NPU, with its distinct hardware and software architecture, can challenge this dominance. Here, we summarize the technical findings and implications of the research.

Key Findings

Performance Analysis

The analysis covers three critical aspects of AI computation: primitive compute, memory, and communication operations. The Gaudi NPU's Matrix Multiplication Engine (MME) exhibits strong performance in general matrix multiplication tasks (GEMM), achieving higher compute throughput and utilization compared to NVIDIA's A100 GPU. This performance is attributed to the design of Gaudi's MME, which can dynamically reconfigure its systolic array to optimize for different GEMM shapes. For non-GEMM operations performed by Gaudi's Tensor Processing Cores (TPCs), Gaudi-2 is found to be equally efficient albeit falling short of A100 in absolute performance due to a lower vector math throughput.

In memory-bound operations, Gaudi-2 shows competitive performance on regular data access patterns but is less efficient with irregular memory accesses such as vector gather-scatter operations when access granularity falls below 256 bytes. Meanwhile, the collective communication bandwidth on Gaudi-2 is adept when scaling up to full system utilization but falters with fewer participating devices compared to NVIDIA's consistent performance across scales due to its NVSwitch-enabled architecture.

End-to-End Applications

The performance evaluation extended to comprehensive AI workloads, including recommendation systems (RecSys) and LLMs. The Gaudi-2 demonstrated significant energy efficiency benefits in LLM serving, attributed to its powerful MMEs and the nature of LLM workloads being heavily reliant on GEMM operations. However, in RecSys models characterized by sparse, vector-gather-dependent computations, Gaudi-2 struggled to match the performance of the A100, although it displayed slightly better energy efficiency due to its lower power usage.

Programmability Considerations

The research also explores the programmability of Gaudi-2, offering insights from case studies. While Gaudi offers a flexible single program multiple data (SPMD) programming model via TPC-C for vector operations, the lack of low-level programmability for its MMEs presents challenges. The paper illustrates this through optimizations for embedding lookup operations in RecSys and PagedAttention within the vLLM framework.

For RecSys, the paper shows that implementing low-level optimizations using TPC-C can bridge some of the performance gaps, especially for larger batch sizes and vector formats. However, inefficiencies remain for smaller data accesses. For LLMs leveraging vLLM, Gaudi-optimized implementations benefit from improvements in Graph Compiler optimization, although they remain behind in performance relative to NVIDIA's CUDA-optimized counterpart due to restrictions in customizing low-level computations.

Implications and Future Directions

The paper concludes that Intel's Gaudi NPUs show promise as an alternative to NVIDIA GPUs, particularly in energy efficiency and LLM workloads. However, the current NDA and software stack impose limitations that obscure developer insights into the operations of Gaudi's MME, and hinder optimization of memory-intensive operations like vector gathers critical to RecSys. To bolster its competitiveness, the Gaudi ecosystem would benefit greatly from enhanced low-level programmability and detailed documentation of system internals.

In future studies, expanding the comparative analysis to include GPU vendors like AMD or evaluating Gaudi in the context of large-scale AI model training could provide more granular insights into Gaudi's positioning within the AI accelerator landscape. Additionally, exploring how future iterations of Gaudi address the identified challenges would be valuable for researchers and practitioners considering Gaudi for AI deployment.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com