Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 67 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs (2502.12444v1)

Published 18 Feb 2025 in cs.LG, cs.AI, cs.AR, and cs.PF

Abstract: LLMs have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX

Summary

Acceleration of Compressed LLMs Using AMX-Powered CPUs

The paper "SPARAMX: Accelerating Compressed LLMs Token Generation on AMX-Powered CPUs" explores innovative methods for reducing the computational and latency constraints posed by LLMs by employing AMX-enabled CPUs and unstructured sparsity. This research provides an in-depth examination of these techniques, demonstrating up to a 1.42x reduction in end-to-end latency for LLM inference compared to existing PyTorch implementations, and introduces a set of customizable sparse kernels for efficiency improvements.

The researchers detail the background on the exponential growth and utilization of LLMs, which has been paralleled by an increase in the necessity for high-performance computing resources, typically provided by GPUs and TPUs. They propose exploiting the ubiquity and energy efficiency of CPUs to enable broader access to AI, highlighting the potential of Intel's AMX capabilities in newer CPU generations such as Sapphire Rapids.

Contributions and Results

Custom Sparse Kernels: The paper introduces open-source customized sparse kernels that replace linear layers in PyTorch models, leveraging unstructured sparsity to reduce memory operations during the memory-bound decode phase of LLM inference.
Kernel Design: The researchers devised both dense and sparse kernels, exploiting AMX's advanced matrix multiplication features. Significant innovations include a preprocessing step allowing weights to be stored in a compressed format known as 'weight_metadata' and 'weight_values,' reducing the overhead associated with memory transfer during run-time.
Performance Metrics: Evaluations demonstrate a speedup range between 1.22x and 2.03x in specific linear modules of Llama 3 8B models when using these custom kernels compared to the standard PyTorch implementation. Additionally, an INT8 kernel using AMX shows up to 1.46x better performance over existing proprietary solutions for quantized models.
Unstructured Sparsity in Attention Computation: For the first time, unstructured sparsity is applied to attention computation processes, achieving a 1.14x speedup without notable accuracy loss. This was accomplished by modifying the handling of the KV cache during the decode stage.
Comparison with Proprietary Software: In end-to-end evaluations, particularly those conducted at higher batch sizes, the AMX-powered implementations demonstrated superior throughput when compared to proprietary software such as DeepSparse.

Implications and Future Work

The demonstrated methodologies have profound implications for the increasing accessibility of LLMs on energy-efficient platforms. The combination of unstructured sparsity with AMX-supported matrix multiplication paves the way for potentially significant reductions in computational costs and latency. This is particularly pertinent for applications targeting edge devices or environments where energy efficiency and cost-effectiveness are paramount.

The paper also opens avenues for further theoretical exploration, particularly in optimizing sparsity implementations to enhance compute-bound scenarios' performance, which remains a challenge as indicated by the results. Future work could focus on integrating these kernels into more general-purpose frameworks and extending support to various data formats, including more aggressive quantization strategies such as INT4.

In conclusion, the SPARAMX kernels represent a noteworthy step toward democratizing access to high-powered LLM capabilities, offering a tangible solution to the current computational challenges faced within AI-driven fields. Such advancements foster the possibility of broader AI deployments across multiple industry sectors, making sophisticated LLMs more accessible and practical for real-world applications.