Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression (2505.01637v1)

Published 3 May 2025 in cs.PL and cs.LG

Abstract: High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_{\geq 0}$ and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google's gemma.cpp.

Summary

Dynamic Programming and Spatial Compression: A New Angle on Neural Network Compilation

The paper "Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression" presents an innovative approach to optimizing high-throughput neural network inference by leveraging dynamic programming and spatial compression techniques. Authored by Samuel J. Kaufman, René Just, and Rastislav Bodik, the work tackles the challenge of managing the expansive search space of program optimizations crucial for efficient neural network execution.

Overview of Morello Compiler

At the heart of the paper is Morello, a compiler designed to enhance the synthesis of neural network implementations. The compiler employs a dynamic programming methodology to navigate the intricate search space formed by optimization decisions related to memory layout, parallel tiling, and microkernel selection. Unlike traditional auto-schedulers which heuristically sample the space, Morello iteratively decomposes extensive program specifications into smaller, manageable components, applying a sequence of rewrites.

Technical Contributions

The authors introduce two main innovations: a recursive decomposition method allowing deeper exploration of the optimization space and a novel spatial data structure for memoization. This data structure systematically indexes program specifications by integer coordinates and compresses solutions by merging identical adjacent entries, significantly reducing memory demands.

Cost Model and Optimizations

The paper highlights the use of a straightforward affine cost model, demonstrating its effectiveness in identifying high-throughput program variations. The model aligns well with x86 target architectures, proving sufficient to rank implementations based on throughput potential efficiently and consistently.

Practical Implications

Morello successfully synthesized various matrix multiplication benchmarks targeting Zen 1 CPUs, including a 1×2048×16384 bfloat16-to-float32 vector-matrix multiply, subsequently integrated into Google's gemma.cpp. This integration underscores Morello's capability to deliver optimized solutions in real-world applications.

Future Directions

The paper speculates on extending the Morello framework to support nonlinear computation graphs, thereby broadening the types of neural network structures it can optimize. Additionally, it hints at further adaptations, such as enhanced pipeline parallelism modeling and improved scheduling flexibility, to accommodate specific user requirements.

Conclusion

This research represents a substantial step in compiler design for neural networks, emphasizing dynamic programming as a viable route to tackling the vast optimization space more comprehensively. Morello’s approach, focusing on recursive decomposition and efficient memoization, offers valuable insights for the continued evolution of compiler technologies aimed at high-performance execution of deep neural networks.