Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures (2504.09870v1)

Published 14 Apr 2025 in cs.AR, cs.LG, and cs.PL

Abstract: Irregular embedding lookups are a critical bottleneck in recommender models, sparse LLMs, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (11)

Tweets

https://twitter.com/MuzafferKal_/status/1913045207656112295