Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets
Abstract: Granlund and Montgomery proposed an optimization method for unsigned integer division by constants [3]. Their method (called the GM method in this paper) was further improved in part by works such as [1] and [7], and is now adopted by major compilers including GCC, Clang, Microsoft Compiler, and Apple Clang. However, for example, for x/7, the generated code is designed for 32-bit CPUs and therefore does not fully exploit 64-bit capabilities. This paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs. We implemented patches for LLVM/GCC and achieved speedups of 1.67x on Intel Xeon w9-3495X (Sapphire Rapids) and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later. The LLVM patch has already been merged into llvm:main [6], demonstrating the practical applicability of the proposed method.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper shows a faster way for computers to divide 32‑bit numbers by a fixed number (like 7 or 19) when running on modern 64‑bit chips. Instead of using the slow “divide” instruction, the authors find a smarter sequence that mostly uses multiplication, which computers do much faster. They put this new method into real compilers so lots of programs can get quicker without changing their source code.
What problem are the authors trying to solve?
Dividing numbers on a CPU is slower than multiplying them. Compilers (the tools that turn your code into machine instructions) already use a classic trick: when you divide by a fixed number, they replace it with “multiply by a special constant” and then “shift right” to get the same answer. This trick is often called the GM method and uses a “magic number.”
However, current compilers still use a design that fits older 32‑bit CPUs for a tricky subset of divisors. On 64‑bit CPUs, that older design does extra steps it no longer needs. The authors ask: can we make constant division faster on 64‑bit chips by fully using 64‑bit features?
How do compilers usually speed up division by a fixed number?
Here’s the everyday idea:
- Imagine you want to compute x divided by 7.
- Instead of doing slow long division every time, you precompute a “magic number” that acts like the reciprocal of 7 in integer math.
- Then you do: multiply x by that magic number, and shift the result to the right by some amount. This gives the same quotient as x/7 for all 32‑bit x.
In real compilers, this comes from math that guarantees the multiply‑then‑shift gives exactly the same result as dividing by that constant.
What’s the catch? For most divisors, the magic number fits in 32 bits, and the code is simple. For some divisors (like 7, 19, 107), the best magic number needs 33 bits. Older code stayed within 32‑bit limits by breaking the work into several steps and shifts, which adds extra instructions.
A few words you’ll see:
- 32‑bit number: can store 0 up to about 4 billion.
- 64‑bit number: can store much larger values (up to about 18 quintillion).
- Unsigned: only non‑negative numbers.
- Shift right: like dividing by a power of 2 very quickly.
- “Upper 64 bits” of a product: when you multiply two 64‑bit numbers, the exact result can be up to 128 bits long. The “upper half” is like the leftmost part of that long result.
What new approach do the authors propose?
They noticed that on a 64‑bit CPU you don’t need to force every intermediate value to fit in 32 bits. For the “33‑bit magic number” cases, they rearrange the math so that:
- Instead of forming a full 128‑bit result and doing a slow 128‑bit shift, they choose a slightly adjusted constant that fits in 64 bits.
- Then they do one 64‑bit multiply and simply take the upper 64 bits of the result. On many CPUs, getting the top half is a single, fast instruction.
Why this helps:
- On x86‑64 (Intel/AMD), the instruction that shifts a 128‑bit value is relatively slow, about as costly as a multiply.
- Their method avoids that slow 128‑bit shift. It keeps only one multiply and reads the high half of the product.
- On Apple’s ARM chips (like the M4), there’s a special instruction called “umulh” that directly gives the upper 64 bits of the multiplication, so you can do the whole operation with just that one instruction.
In short, for the hard cases where the old code used a multi‑step sequence, the new code uses a single multiply‑style instruction.
How did they test it?
They:
- Modified LLVM (the system underneath the Clang compiler) to use their new rule.
- Built simple benchmark programs that divide many 32‑bit numbers by fixed divisors where the old method struggled (like 7, 19, and 107).
- Generated machine code for two platforms: an Intel Xeon (x86‑64) and an Apple M4 (ARM64).
- Timed how long the programs ran before and after the change.
What did they find, and why does it matter?
Main results:
- On an Intel Xeon w9‑3495X (Sapphire Rapids), code using the new method ran about 1.67× faster.
- On an Apple M4, it ran about 1.98× faster.
- The loops in the generated code became smaller and simpler (fewer instructions).
Why it matters:
- Division by constants happens all over the place: in graphics, data processing, cryptography, and more.
- Faster building blocks in compilers make many programs faster automatically, without developers changing their code.
- Less work per division can also save energy and improve responsiveness.
The practical impact is clear: their LLVM change was accepted into the main project, and they also prepared a GCC patch that’s under review. That means these speedups can reach many languages and applications.
The simple takeaway
When dividing 32‑bit numbers by fixed values on 64‑bit CPUs, the authors found a way to replace a multi‑step “shift-heavy” sequence with a single multiply that uses the CPU’s 64‑bit strengths. This small change in how compilers generate code makes certain divisions nearly twice as fast on real machines—and that improvement can benefit a huge range of software automatically.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide follow-on research and engineering work:
- Scope limitation to 32-bit unsigned division: no treatment of signed division, 8/16/64-bit dividends, or wider cases (e.g., 64-bit dividends on 64-bit targets where “magic” may be 65+ bits and would require 128-bit multiplications).
- Unexplored extension to the 32-bit-magic case: assess whether the same high-multiply trick with a scaled constant () also outperforms the classic “mul+shift” for divisors where fits in 32 bits, on both x86-64 and AArch64.
- Formal correctness proof gap: the identity q = umulh(x, t) with t = c·2{64−a} is sketched but not proven rigorously for all parameters; provide a complete proof that for all relevant d (with 33-bit c) and x ∈ [0, 232−1], floor(x·c / 2a) equals floor(x·t / 264).
- Boundary conditions on a: the paper asserts 33 ≤ a ≤ 63 given d ≤ M/2; provide a formal derivation and verify there are no rare d for which a = 64 would be required (or specify handling if it occurs).
- Completeness across all divisors with 33-bit c: performance and codegen were shown for d ∈ {7, 19, 107}; quantify the speedup distribution and any regressions across the full set of divisors that yield 33-bit c.
- Architecture coverage is narrow: evaluate on additional microarchitectures (e.g., Intel Ice/Tiger/Raptor Lake, Atom/Tremont/Gracemont, AMD Zen 2/3/4/5, Neoverse cores, older A57/A72, Apple M1/M2/M3, POWER, and RISC-V RV64 with mulh) to validate generality and identify exceptions.
- Cost-model and heuristic integration: define target-aware selection rules (latency/throughput, port pressure, register constraints, code size) to decide when to prefer the new transform versus existing GM sequences per architecture/CPU model and optimization level.
- x86 instruction choice and flags/ports: investigate using BMI2 MULX (to avoid RAX/RDX constraints and flags clobbering) versus MUL/IMUL, and quantify impacts on dependency chains, register pressure, and port utilization.
- Immediate encoding and code-size trade-offs: analyze when the 64-bit constant t cannot be encoded as an immediate (x86 needs a mov imm64; AArch64 needs MOVZ/MOVK sequences); measure code-size and performance impacts and propose hoisting/constant-pool strategies and size-aware heuristics (-Os/-Oz).
- Negative cases/regressions: identify workloads where multiply-high is slower than shifts (e.g., on cores with cheap double-shifts or expensive multiplies, high multiply-port pressure, or dependency chains sensitive to mul latency) and gate the transform accordingly.
- Throughput vs latency characterization: provide separate benchmarks for independent-iteration throughput and tight dependency chains to understand the latency sensitivity of the multiply-high approach vs shift sequences.
- Vector/SIMD generalization: study SIMD forms (e.g., x86 AVX2/AVX-512 pmuludq + shifts, ARM NEON/SVE umulh equivalents) for accelerating simultaneous divisions over vectors of u32, including lane-wise correctness and performance.
- Combined quotient and remainder: explore whether computing r = x − q·d (after q via multiply-high) is competitive vs alternative remainder methods (e.g., Lemire’s fast remainder) and when both q and r are needed.
- Interaction with hardware udiv: quantify performance relative to native hardware division on AArch64 and x86-64 across CPUs; clarify when replacing udiv with multiply-high remains beneficial, especially for single-use (non-loop) occurrences.
- Compiler pipeline integration details: ensure the transform is applied consistently across SelectionDAG, GlobalISel, and MIR; analyze interactions with vectorizers (SLP/Loop), LTO, and late peepholes that might either defeat or further optimize the pattern.
- PIC/relocation and constant materialization: evaluate impacts under position-independent code and different relocation models (constant in rodata vs literal pools), including linker relaxation opportunities or penalties.
- Energy and power: measure power/energy effects of replacing shifts with multiplies across architectures, especially in mobile SoCs where energy efficiency matters.
- Robust benchmarking methodology: supplement “time” measurements with hardware performance counters (cycles, uops, IPC, port usage, cache misses) and report statistical rigor (confidence intervals) and code-size deltas.
- Real-world workloads: validate end-to-end impact in representative applications (e.g., cryptographic codebases cited, compression, parsers) rather than microbenchmarks alone, and identify domains where the transform offers material wins.
- Toolchain parity and maturity: present results for the GCC patch (not just LLVM), and ensure test suites (e.g., LLVM/GCC regression and performance suites) cover this transform to prevent future regressions.
- Safety under unusual ABIs or sanitizers: verify behavior with sanitizers (UBSan/ASan), unusual calling conventions, and mixed-language bindings where register constraints of MUL/RDX:RAX could interact with ABI-specific rules.
- Distributional insight into 33-bit c cases: beyond reporting the 77%/23% split, characterize which d ranges produce 33-bit c, how a varies with d, and whether performance benefits correlate with a or t magnitude.
- Extension to other constant-division optimizations: explore synergy with reciprocal-multiply techniques for floating-point-like reciprocals, multi-precision divisions, or hybrid schemes that combine table-driven approximations with multiply-high for edge cases.
Practical Applications
Immediate Applications
The following applications can be deployed now, given that the optimization is already merged into LLVM and practical on modern x86-64 and AArch64 CPUs. Each item notes sectors, potential tools/workflows, and key assumptions/dependencies.
- Compiler toolchains (C/C++/Rust/Swift/Julia/Zig) speed-ups “for free”
- Sectors: Software, HPC, embedded/mobile, cloud
- What: Any code compiled with LLVM-based compilers (Clang, Rustc, Swiftc, Julia’s JIT, Zig, many MLIR/LLVM-based DSLs) benefits when performing 32-bit unsigned division by a compile-time constant that yields a 33-bit “magic” (
c) in GM method. - Tools/workflows: Upgrade to LLVM versions that include the patch; enable standard
-O2/-O3optimizations; CI/CD toolchain refresh - Assumptions/dependencies: 64-bit target; divisor is a compile-time constant; GM-derived
cis 33 bits (about 23% of divisors < 231); the optimization pass is enabled in the vendor’s LLVM-based compiler release
- Faster constant-division in performance-critical libraries
- Sectors: Systems software, compression, parsing/formatting, signal processing, game engines
- What: Tight loops that divide by constants (e.g., scaling/quantization factors, table indexing, format conversions such as parsing/printing by 10, 100, etc.) see reduced instruction count and lower latency; on AArch64 the pattern maps to a single
umulh - Tools/workflows: Rebuild with an updated LLVM-based toolchain; performance audits for hot loops; prefer compile-time constants (e.g.,
constexpr) where possible - Assumptions/dependencies: The divisor is known at compile time and falls into the 33-bit
ccase; some libraries already hand-optimize—ensure code doesn’t inhibit compiler pattern recognition
- JIT-compiled analytics and databases
- Sectors: Data analytics, databases, stream processing
- What: LLVM-based query engines (e.g., DuckDB extensions, Velox-based systems, custom LLVM JITs) that lower expressions like
bucket = value / widthwherewidthis constant at plan time - Tools/workflows: Adopt patched LLVM in the JIT; ensure constant folding/promotion exposes constants before codegen
- Assumptions/dependencies: The divisor is a constant at JIT time; JIT uses LLVM back-end; target is 64-bit
- Cryptography and post-quantum cryptography implementations
- Sectors: Security, finance, embedded/mobile
- What: Constant-division that remains in pack/unpack/encoding steps, digit/base conversions, or certain parameter rescalings can speed up; the authors reference ML-KEM/ML-DSA ecosystems and provide related repos
- Tools/workflows: Rebuild libs with LLVM trunk or the first release carrying the patch; where needed, drop-in a header-only “constdiv” routine that emits
mulhi/umulhdirectly for portability - Assumptions/dependencies: Many crypto reductions avoid division (Barrett/Montgomery) so benefits depend on code structure; divisors must be compile-time constants in hot paths
- LLVM-based numerical and ML compilers
- Sectors: AI/ML, scientific computing
- What: Code generators (TVM, XLA backends using LLVM, Halide, MLIR-based pipelines) often materialize constant scaling (e.g., quantized kernels); constant divisions can be lowered to
mulhi - Tools/workflows: Update LLVM in the compiler stack; ensure constant-propagation occurs before lowering; add IR patterns that preserve constant-division forms
- Assumptions/dependencies: Scales are constants at compile/JIT time; the target CPU supports fast 64×64→128 mul with cheap access to the high half (x86-64/AArch64)
- Energy and battery efficiency improvements
- Sectors: Mobile/edge devices, data centers
- What: Fewer instructions and lower runtime for constant-division-heavy loops reduce energy per operation
- Tools/workflows: Recompile with updated LLVM; include this optimization in energy/perf regression dashboards; A/B test power under representative workloads
- Assumptions/dependencies: Workload contains nontrivial volumes of 32-bit constant divisions in the 33-bit
cclass; DVFS and scheduling effects may modulate realized savings
- Static optimization and binary tooling
- Sectors: Tooling, performance engineering
- What: LLVM
opt/llc, BOLT, or custom MachineFunction passes can canonicalize older GM sequences into the new single-multiply form on 64-bit targets - Tools/workflows: Integrate a post-link binary optimizer pass; pattern-match the 3-shift GM sequence and replace with
mulhiform - Assumptions/dependencies: Reliability of pattern detection; target CPU and ABI constraints; full correctness across corner cases
- Developer guidance and code review checklists
- Sectors: Software engineering, education
- What: Update guidelines to (a) keep constant divisors as compile-time constants, (b) avoid manual sequences that block optimization, and (c) rely on the compiler to emit the optimal
mulhi-based code - Tools/workflows: Linters or Clang-Tidy checks to detect manual patterns; CI enforcing modern compilers
- Assumptions/dependencies: Teams can upgrade toolchains; codebases allow minor refactors to expose constants
- Interim portability via micro-libraries
- Sectors: Embedded, cross-platform systems
- What: Use a small, header-only helper that precomputes
2^(64−a) * cand appliesmulhito computex / dforu32/constantd, ensuring consistent performance across compilers - Tools/workflows: Integrate “constdiv” (as cited by the authors) while waiting for GCC adoption or for vendor LLVM releases to roll out
- Assumptions/dependencies: Careful testing for all edge cases; ensure ABI constraints (e.g., inline asm vs. portable intrinsics) are respected
Long-Term Applications
These require additional research, engineering, or ecosystem adoption before broad deployment.
- Generalization to 64-bit dividends by constants on 64-bit targets
- Sectors: Software, HPC, databases, cryptography
- What: Extend the “scale-then-mulhi” idea to
u64/constant divisions (where “magic” may be 65 bits) to avoid 128-bit shifts in quotient computation - Dependencies: New proofs/derivations for bounds; careful codegen on architectures with different mul/shift trade-offs; compiler integration and benchmarking across microarchitectures
- Vectorized/SIMD constant division
- Sectors: AI/ML, graphics, signal processing, databases
- What: Introduce lane-wise
mulhistrategies in NEON/SVE and AVX2/AVX-512 for vectorizedu32/constant divisions; benefit columnar scans and image/audio kernels - Dependencies: Instruction availability (e.g., per-lane high-half multiplies); vectorizer support in LLVM; ensuring the transform’s profitability model in the presence of lane permutations
- Unified quotient+remainder optimization
- Sectors: Compilers, systems libraries
- What: Combine this quotient method with state-of-the-art remainder algorithms (e.g., “Faster Remainder by Direct Computation”) to pick optimal sequences for
/and%by constants - Dependencies: Compiler heuristics that compose transformations, reduce register pressure, and respect microarchitectural latencies
- PGO-driven specialization for constant divisions
- Sectors: Software at scale, cloud services
- What: Use Profile-Guided Optimization to identify hot constant-division sites, ensure the constant form is preserved, and auto-select the best codegen variant per CPU family
- Dependencies: PGO infrastructure; IR stability to retain constants; CPU-specific cost models
- Domain-specific JIT adoption (quantization-heavy inference/graphics pipelines)
- Sectors: AI inference, AR/VR, robotics
- What: Embed codegen rules in DSLs (Halide/TVM/MLIR) to prefer
mulhilowering for compile-time scales (e.g., uniform quantization steps or fixed bin widths) - Dependencies: Ensuring those scales are known at JIT-time; cross-target testing; maintaining numerical equivalence (rounding vs. truncation behavior)
- Hardware/ISA feedback loop
- Sectors: Semiconductor, CPU design
- What: Encourage high-throughput, low-latency access to the high half of multiplies (e.g., streamlined
mulhiwithout RAX/RDX dependencies on x86) to further improve this and similar idioms - Dependencies: ISA evolution cycles; microarchitectural design trade-offs; compiler support to detect and exploit new instructions
- Formal verification and correctness tooling
- Sectors: Safety-critical systems, academia
- What: Mechanize proofs for the scaled-magic approach, build compiler verification tests, and integrate into formal frameworks for arithmetic code transformations
- Dependencies: Proof frameworks (Coq/Isabelle), conformance suites, and cooperation with compiler vendors
- Policy and sustainability initiatives
- Sectors: Public sector IT, energy
- What: Incorporate modern compiler requirements (with optimizations like this) into procurement to reduce compute energy budgets at scale
- Dependencies: Coordination with OS vendors and distributions; certification baselines that include compiler versioning and benchmarks
Notes on feasibility and scope
- Trigger conditions: The optimization applies to 32-bit unsigned division by a compile-time constant where the GM “magic” constant
cis 33 bits; other cases (power-of-two divisors, very large divisors, or 32-bitc) already have efficient code paths. - CPU prerequisites: 64-bit targets with fast 64×64→128 multiplication and cheap access to the high 64 bits (e.g., x86-64
multo RDX:RAX; AArch64umulh). - Expected impact: Microbenchmarks show 1.67× (Sapphire Rapids) to 1.98× (Apple M4) speedups for affected divisions; end-to-end gains depend on how frequently such divisions occur in hot paths.
- Rollout status: LLVM change is merged to
llvm:main; GCC patch is under review, so timelines vary by distribution and vendor toolchains.
Glossary
- AArch64: A 64-bit ARM architecture used in modern ARM and Apple processors. "On AArch64 architectures such as Apple M4, the {\tt u64}Ã{\tt u64}={\tt u128} multiplication is split into {\tt umulh}, which returns the upper 64 bits, and {\tt mul}, which returns the lower 64 bits."
- Apple M4: An Apple-designed ARM-based processor used as a 64-bit target in the paper’s evaluations. "and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later."
- BMI2 (-mbmi2): An x86-64 instruction set extension providing advanced bit-manipulation operations; the -mbmi2 flag enables it in code generation. "For x86-64, the comparison used -O2 -mbmi2."
- Clang: A C/C++/Objective-C compiler frontend based on LLVM. "including GCC, Clang, Microsoft Compiler, and Apple Clang."
- GM method: The Granlund–Montgomery technique for replacing division by invariant integers with multiplication and shifting. "Their method (called the GM method in this paper)"
- latency: The time (in cycles) for an instruction to produce its result. "the 128-bit logical right-shift instruction {\tt shrd} requires the same latency and throughput as {\tt mul} on Skylake-X \cite{anger}."
- llc: LLVM’s static compiler backend that lowers LLVM IR to machine assembly. "Then we used LLVM's llc to generate assembly for each CPU."
- logical right-shift: A bit shift that inserts zeros on the left, used to emulate unsigned division by powers of two. "the 128-bit logical right-shift instruction {\tt shrd}"
- LLVM: A modular compiler infrastructure providing toolchains and backends for multiple architectures. "We implemented the proposed optimization in LLVM (the compiler infrastructure of Clang)"
- LLVM IR: LLVM’s typed, low-level intermediate representation used for machine-independent optimizations. "First, we generated LLVM IR bench.ll"
- microbenchmark: A small, focused benchmark designed to measure the performance of a specific operation. "in the microbenchmark described later."
- Sapphire Rapids: Intel Xeon server microarchitecture used as an evaluation platform. "Intel Xeon w9-3495X (Sapphire Rapids)"
- shrd: An x86 instruction that performs a doubleword (128-bit) right shift across two registers. "the 128-bit logical right-shift instruction {\tt shrd} requires"
- Skylake-X: An Intel high-end desktop/server microarchitecture used for performance characterization. "on Skylake-X \cite{anger}"
- SoC (System on Chip): An integrated circuit that consolidates CPU, memory, and peripherals on a single chip. "(Apple M-series SoC)"
- throughput: The rate at which instructions can be issued/retired (e.g., instructions per cycle). "the 128-bit logical right-shift instruction {\tt shrd} requires the same latency and throughput as {\tt mul} on Skylake-X \cite{anger}."
- u32/u64/u128: Unsigned integer types with 32-, 64-, and 128-bit widths, respectively. "We denote 32/64/128-bit unsigned integer types by {\tt u32/u64/u128}."
- umulh: An AArch64 instruction that returns the upper 64 bits of a 128-bit product of two 64-bit integers. "is split into {\tt umulh}, which returns the upper 64 bits"
- x86-64: The 64-bit extension of the x86 architecture used by modern Intel/AMD CPUs. "For x86-64, the comparison used -O2 -mbmi2."
- zero-extended: An operation that widens an integer to a larger width by filling higher bits with zeros. "where is zero-extended to 64 bits."
Collections
Sign up for free to add this paper to one or more collections.