Optimizing 32-bit Division by Constants on 64-bit Architectures
This lightning talk explores a compiler optimization breakthrough that makes 32-bit unsigned division by constants significantly faster on modern 64-bit CPUs. By recognizing that legacy compiler code generation fails to leverage wider registers and multiply instructions available on 64-bit architectures, researchers developed a new algorithm that replaces inefficient three-shift sequences with a single multiply-high operation. The result: up to 2x speedup on real hardware and immediate adoption in LLVM and GCC, demonstrating how co-optimizing arithmetic transformations with modern architectural capabilities can unlock substantial performance gains.Script
Division operations are expensive on modern CPUs, taking tens of clock cycles compared to just a few for multiplication. Compilers have long replaced division by constants with faster multiply-and-shift tricks, but there's a hidden inefficiency: code generators designed for 32-bit CPUs waste the power of today's 64-bit architectures.
The Granlund-Montgomery method replaces division with multiplication by a magic constant and right shifts. For most divisors, this constant fits in 32 bits. But for roughly 23% of cases, it requires 33 bits, and that's where current compilers stumble, emitting three separate shift stages to reconstruct the quotient using only 32-bit intermediates.
What if we could eliminate all those shifts with a single operation?
The key insight is mathematical: you can always reformulate the division to use one 64-bit multiplication where the rescaled magic constant fits in 64 bits, then extract the high half of the product. On AArch64, this becomes a single multiply-high instruction. On x86-64, it's one multiply with straightforward high-half extraction.
Benchmarks using divisors 7, 19, and 107 on Intel Xeon and Apple M4 chips show dramatic improvements. The Apple M4 nearly doubles performance, cutting latency from 6.7 seconds to 3.4 seconds. These aren't synthetic wins: the patches have been merged into LLVM and submitted to GCC.
This isn't just about three divisors in a microbenchmark. It affects every 32-bit division by constant requiring a 33-bit multiplier across all compiled code. Inner loops in numerics and cryptography benefit immediately, and the approach generalizes to wider integers and future architectures, proving that compiler optimizations must evolve with hardware capabilities.
Legacy code generation patterns can silently handicap modern hardware for decades. This work shows that revisiting classical compiler transformations with fresh eyes on today's architectures unlocks performance hiding in plain sight. Visit EmergentMind.com to learn more and create your own research videos.