Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

Published 9 Apr 2026 in cs.PL and cs.AR | (2604.07902v1)

Abstract: Granlund and Montgomery proposed an optimization method for unsigned integer division by constants [3]. Their method (called the GM method in this paper) was further improved in part by works such as [1] and [7], and is now adopted by major compilers including GCC, Clang, Microsoft Compiler, and Apple Clang. However, for example, for x/7, the generated code is designed for 32-bit CPUs and therefore does not fully exploit 64-bit capabilities. This paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs. We implemented patches for LLVM/GCC and achieved speedups of 1.67x on Intel Xeon w9-3495X (Sapphire Rapids) and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later. The LLVM patch has already been merged into llvm:main [6], demonstrating the practical applicability of the proposed method.

Summary

  • The paper introduces a new algorithm for 32-bit unsigned division by constants on 64-bit CPUs, replacing multi-shift sequences with a single multiply and shift.
  • It leverages 64-bit multiply-high instructions to streamline the division process, eliminating inefficient 32-bit intermediates used in legacy compilers.
  • Empirical benchmarks on Intel and Apple CPUs demonstrate speedups up to 1.98x, driving adoption in mainstream compiler toolchains like LLVM and GCC.

Optimizing 32-bit Unsigned Division by Constants on 64-bit Architectures

Introduction

Division by constant integers is a classical optimization target in compilers due to the prohibitive latency of division instructions on modern CPUs. While the Granlund-Montgomery (GM) method is traditionally used for implementing division by constants via multiplication and shift operations, legacy algorithmic forms in current compilers fail to fully leverage the computational capabilities and wider registers available in 64-bit architectures. This work develops and formalizes a new approach specifically targeting 32-bit unsigned division by constants on 64-bit CPUs, yielding significant empirical improvements and resulting in adopted changes in mainstream compiler toolchains.

Background and Motivation

The GM method, and its refinements [GM, CDW, Hacker, Optimal], replace division by invariant constants with faster sequences involving a precomputed "magic" multiplier and right shifts. Compilers usually handle 32-bit unsigned division by computing, at compile-time, a multiplier cc and a shift aa such that division by dd is replaced with (x×c)a(x \times c) \gg a for x[0,2321]x \in [0, 2^{32}-1]. For most dd, cc fits in 32 bits; for about 23% of cases, it requires 33 bits. Because legacy code generation is optimized for 32-bit targets, even 64-bit code generators restrict intermediate results to 32 bits, leading to inefficient instruction sequences on 64-bit CPUs.

Analysis of Traditional Compiler Approaches

For 32-bit unsigned division with a 33-bit magic constant cc (i.e., c[232,2331]c \in [2^{32}, 2^{33}-1]), current compiler-generated code (LLVM, GCC, Apple Clang, and MSVC) emits three shift stages after a multiplication to reconstruct the correct quotient using only 32-bit intermediates. This sequence exists to support 32-bit CPUs, and while correct, incurs suboptimal performance on 64-bit targets where wide registers and multiply instructions can directly support more efficient forms.

The x86-64 IMUL and x86-64/AArch64 multiword multiply and shift operations could make these sequences redundant, but are not utilized for this division pattern.

Proposed Optimization Methodology

The core observation enabling the improved algorithm is that for the 33-bit magic constant case on a 64-bit CPU, the division can always be implemented with a single multiply operation and a 64-bit right shift, specifically:

(x×c)a=((x×(264a×c))64)(x \times c) \gg a = ((x \times (2^{64-a} \times c)) \gg 64)

Here:

  • Only one 64-bit multiplication is required (aa0 zero-extended to 64 bits).
  • aa1 is guaranteed to fit in 64 bits (aa2).
  • No intermediate overflows occur.
  • No 128-bit logical shifts are necessary; high and low halves of the product are realized by existing hardware instructions.

On AArch64, this transformation allows replacement by a single UMULH instruction, which is highly efficient. On x86-64, the method is implemented with one multiply and a straightforward high-half extraction, removing redundant shifting and masking.

Experimental Results

Patches using this methodology were merged into LLVM and submitted to GCC. Empirical benchmarks using the divisors 7, 19, and 107 (each necessitating a 33-bit magic constant) on Intel Xeon w9-3495X (Sapphire Rapids) and Apple M4 SoCs demonstrated compelling speedups:

CPU Legacy Latency (s) Optimized Latency (s) Speedup
Xeon w9-3495X 6.33 3.80 1.67x
Apple M4 6.70 3.38 1.98x

These measurements used default system configurations with repeated sampling, showing both strong mean improvements and low variance.

Implementation and Upstream Impact

The LLVM patch for this optimization has been accepted upstream [LLVMopt]. A GCC patch is under review [GCCpatch]. Replacing the three-shift GM method with the multiply-high approach on 64-bit targets streamlines code generation and enhances performance for all constant divisors requiring a 33-bit multiplier.

Practical and Theoretical Implications

This result demonstrates that constant-division lowering, a staple of compiler code generation, must adapt to evolving hardware capabilities rather than adhering to legacy restrictions. The switch to multiply-high for 33-bit constants on 64-bit CPUs reduces both dynamic instruction count and dependency depth, directly impacting inner-loop performance, especially in numerics- and cryptography-heavy workloads. Furthermore, the technique generalizes to wider integers in future architectures and can inform domain-specific compiler transformations and JITs.

Theoretically, it formalizes a completeness result: every 32-bit division by constant can be implemented, without conditionals, using at most one multiplication and one shift, given adequate register width.

Conclusion

The optimized algorithm for 32-bit unsigned division by constants on 64-bit architectures obviates the need for multi-shift GM method implementations when 33-bit multipliers are required. The empirical speedups and rapid adoption in mainstream toolchains signal both practical and methodological advancement. This work highlights the ongoing necessity to co-optimize arithmetic transformations with the architectural capabilities of the hardware, and points toward further lowering improvements as register widths and multiply instructions continue to scale.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper shows a faster way for computers to divide 32‑bit numbers by a fixed number (like 7 or 19) when running on modern 64‑bit chips. Instead of using the slow “divide” instruction, the authors find a smarter sequence that mostly uses multiplication, which computers do much faster. They put this new method into real compilers so lots of programs can get quicker without changing their source code.

What problem are the authors trying to solve?

Dividing numbers on a CPU is slower than multiplying them. Compilers (the tools that turn your code into machine instructions) already use a classic trick: when you divide by a fixed number, they replace it with “multiply by a special constant” and then “shift right” to get the same answer. This trick is often called the GM method and uses a “magic number.”

However, current compilers still use a design that fits older 32‑bit CPUs for a tricky subset of divisors. On 64‑bit CPUs, that older design does extra steps it no longer needs. The authors ask: can we make constant division faster on 64‑bit chips by fully using 64‑bit features?

How do compilers usually speed up division by a fixed number?

Here’s the everyday idea:

  • Imagine you want to compute x divided by 7.
  • Instead of doing slow long division every time, you precompute a “magic number” that acts like the reciprocal of 7 in integer math.
  • Then you do: multiply x by that magic number, and shift the result to the right by some amount. This gives the same quotient as x/7 for all 32‑bit x.

In real compilers, this comes from math that guarantees the multiply‑then‑shift gives exactly the same result as dividing by that constant.

What’s the catch? For most divisors, the magic number fits in 32 bits, and the code is simple. For some divisors (like 7, 19, 107), the best magic number needs 33 bits. Older code stayed within 32‑bit limits by breaking the work into several steps and shifts, which adds extra instructions.

A few words you’ll see:

  • 32‑bit number: can store 0 up to about 4 billion.
  • 64‑bit number: can store much larger values (up to about 18 quintillion).
  • Unsigned: only non‑negative numbers.
  • Shift right: like dividing by a power of 2 very quickly.
  • “Upper 64 bits” of a product: when you multiply two 64‑bit numbers, the exact result can be up to 128 bits long. The “upper half” is like the leftmost part of that long result.

What new approach do the authors propose?

They noticed that on a 64‑bit CPU you don’t need to force every intermediate value to fit in 32 bits. For the “33‑bit magic number” cases, they rearrange the math so that:

  • Instead of forming a full 128‑bit result and doing a slow 128‑bit shift, they choose a slightly adjusted constant that fits in 64 bits.
  • Then they do one 64‑bit multiply and simply take the upper 64 bits of the result. On many CPUs, getting the top half is a single, fast instruction.

Why this helps:

  • On x86‑64 (Intel/AMD), the instruction that shifts a 128‑bit value is relatively slow, about as costly as a multiply.
  • Their method avoids that slow 128‑bit shift. It keeps only one multiply and reads the high half of the product.
  • On Apple’s ARM chips (like the M4), there’s a special instruction called “umulh” that directly gives the upper 64 bits of the multiplication, so you can do the whole operation with just that one instruction.

In short, for the hard cases where the old code used a multi‑step sequence, the new code uses a single multiply‑style instruction.

How did they test it?

They:

  • Modified LLVM (the system underneath the Clang compiler) to use their new rule.
  • Built simple benchmark programs that divide many 32‑bit numbers by fixed divisors where the old method struggled (like 7, 19, and 107).
  • Generated machine code for two platforms: an Intel Xeon (x86‑64) and an Apple M4 (ARM64).
  • Timed how long the programs ran before and after the change.

What did they find, and why does it matter?

Main results:

  • On an Intel Xeon w9‑3495X (Sapphire Rapids), code using the new method ran about 1.67× faster.
  • On an Apple M4, it ran about 1.98× faster.
  • The loops in the generated code became smaller and simpler (fewer instructions).

Why it matters:

  • Division by constants happens all over the place: in graphics, data processing, cryptography, and more.
  • Faster building blocks in compilers make many programs faster automatically, without developers changing their code.
  • Less work per division can also save energy and improve responsiveness.

The practical impact is clear: their LLVM change was accepted into the main project, and they also prepared a GCC patch that’s under review. That means these speedups can reach many languages and applications.

The simple takeaway

When dividing 32‑bit numbers by fixed values on 64‑bit CPUs, the authors found a way to replace a multi‑step “shift-heavy” sequence with a single multiply that uses the CPU’s 64‑bit strengths. This small change in how compilers generate code makes certain divisions nearly twice as fast on real machines—and that improvement can benefit a huge range of software automatically.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide follow-on research and engineering work:

  • Scope limitation to 32-bit unsigned division: no treatment of signed division, 8/16/64-bit dividends, or wider cases (e.g., 64-bit dividends on 64-bit targets where “magic” may be 65+ bits and would require 128-bit multiplications).
  • Unexplored extension to the 32-bit-magic case: assess whether the same high-multiply trick with a scaled constant (t=c264at = c \cdot 2^{64-a}) also outperforms the classic “mul+shift” for divisors where cc fits in 32 bits, on both x86-64 and AArch64.
  • Formal correctness proof gap: the identity q = umulh(x, t) with t = c·2{64−a} is sketched but not proven rigorously for all parameters; provide a complete proof that for all relevant d (with 33-bit c) and x ∈ [0, 232−1], floor(x·c / 2a) equals floor(x·t / 264).
  • Boundary conditions on a: the paper asserts 33 ≤ a ≤ 63 given d ≤ M/2; provide a formal derivation and verify there are no rare d for which a = 64 would be required (or specify handling if it occurs).
  • Completeness across all divisors with 33-bit c: performance and codegen were shown for d ∈ {7, 19, 107}; quantify the speedup distribution and any regressions across the full set of divisors that yield 33-bit c.
  • Architecture coverage is narrow: evaluate on additional microarchitectures (e.g., Intel Ice/Tiger/Raptor Lake, Atom/Tremont/Gracemont, AMD Zen 2/3/4/5, Neoverse cores, older A57/A72, Apple M1/M2/M3, POWER, and RISC-V RV64 with mulh) to validate generality and identify exceptions.
  • Cost-model and heuristic integration: define target-aware selection rules (latency/throughput, port pressure, register constraints, code size) to decide when to prefer the new transform versus existing GM sequences per architecture/CPU model and optimization level.
  • x86 instruction choice and flags/ports: investigate using BMI2 MULX (to avoid RAX/RDX constraints and flags clobbering) versus MUL/IMUL, and quantify impacts on dependency chains, register pressure, and port utilization.
  • Immediate encoding and code-size trade-offs: analyze when the 64-bit constant t cannot be encoded as an immediate (x86 needs a mov imm64; AArch64 needs MOVZ/MOVK sequences); measure code-size and performance impacts and propose hoisting/constant-pool strategies and size-aware heuristics (-Os/-Oz).
  • Negative cases/regressions: identify workloads where multiply-high is slower than shifts (e.g., on cores with cheap double-shifts or expensive multiplies, high multiply-port pressure, or dependency chains sensitive to mul latency) and gate the transform accordingly.
  • Throughput vs latency characterization: provide separate benchmarks for independent-iteration throughput and tight dependency chains to understand the latency sensitivity of the multiply-high approach vs shift sequences.
  • Vector/SIMD generalization: study SIMD forms (e.g., x86 AVX2/AVX-512 pmuludq + shifts, ARM NEON/SVE umulh equivalents) for accelerating simultaneous divisions over vectors of u32, including lane-wise correctness and performance.
  • Combined quotient and remainder: explore whether computing r = x − q·d (after q via multiply-high) is competitive vs alternative remainder methods (e.g., Lemire’s fast remainder) and when both q and r are needed.
  • Interaction with hardware udiv: quantify performance relative to native hardware division on AArch64 and x86-64 across CPUs; clarify when replacing udiv with multiply-high remains beneficial, especially for single-use (non-loop) occurrences.
  • Compiler pipeline integration details: ensure the transform is applied consistently across SelectionDAG, GlobalISel, and MIR; analyze interactions with vectorizers (SLP/Loop), LTO, and late peepholes that might either defeat or further optimize the pattern.
  • PIC/relocation and constant materialization: evaluate impacts under position-independent code and different relocation models (constant in rodata vs literal pools), including linker relaxation opportunities or penalties.
  • Energy and power: measure power/energy effects of replacing shifts with multiplies across architectures, especially in mobile SoCs where energy efficiency matters.
  • Robust benchmarking methodology: supplement “time” measurements with hardware performance counters (cycles, uops, IPC, port usage, cache misses) and report statistical rigor (confidence intervals) and code-size deltas.
  • Real-world workloads: validate end-to-end impact in representative applications (e.g., cryptographic codebases cited, compression, parsers) rather than microbenchmarks alone, and identify domains where the transform offers material wins.
  • Toolchain parity and maturity: present results for the GCC patch (not just LLVM), and ensure test suites (e.g., LLVM/GCC regression and performance suites) cover this transform to prevent future regressions.
  • Safety under unusual ABIs or sanitizers: verify behavior with sanitizers (UBSan/ASan), unusual calling conventions, and mixed-language bindings where register constraints of MUL/RDX:RAX could interact with ABI-specific rules.
  • Distributional insight into 33-bit c cases: beyond reporting the 77%/23% split, characterize which d ranges produce 33-bit c, how a varies with d, and whether performance benefits correlate with a or t magnitude.
  • Extension to other constant-division optimizations: explore synergy with reciprocal-multiply techniques for floating-point-like reciprocals, multi-precision divisions, or hybrid schemes that combine table-driven approximations with multiply-high for edge cases.

Practical Applications

Immediate Applications

The following applications can be deployed now, given that the optimization is already merged into LLVM and practical on modern x86-64 and AArch64 CPUs. Each item notes sectors, potential tools/workflows, and key assumptions/dependencies.

  • Compiler toolchains (C/C++/Rust/Swift/Julia/Zig) speed-ups “for free”
    • Sectors: Software, HPC, embedded/mobile, cloud
    • What: Any code compiled with LLVM-based compilers (Clang, Rustc, Swiftc, Julia’s JIT, Zig, many MLIR/LLVM-based DSLs) benefits when performing 32-bit unsigned division by a compile-time constant that yields a 33-bit “magic” (c) in GM method.
    • Tools/workflows: Upgrade to LLVM versions that include the patch; enable standard -O2/-O3 optimizations; CI/CD toolchain refresh
    • Assumptions/dependencies: 64-bit target; divisor is a compile-time constant; GM-derived c is 33 bits (about 23% of divisors < 231); the optimization pass is enabled in the vendor’s LLVM-based compiler release
  • Faster constant-division in performance-critical libraries
    • Sectors: Systems software, compression, parsing/formatting, signal processing, game engines
    • What: Tight loops that divide by constants (e.g., scaling/quantization factors, table indexing, format conversions such as parsing/printing by 10, 100, etc.) see reduced instruction count and lower latency; on AArch64 the pattern maps to a single umulh
    • Tools/workflows: Rebuild with an updated LLVM-based toolchain; performance audits for hot loops; prefer compile-time constants (e.g., constexpr) where possible
    • Assumptions/dependencies: The divisor is known at compile time and falls into the 33-bit c case; some libraries already hand-optimize—ensure code doesn’t inhibit compiler pattern recognition
  • JIT-compiled analytics and databases
    • Sectors: Data analytics, databases, stream processing
    • What: LLVM-based query engines (e.g., DuckDB extensions, Velox-based systems, custom LLVM JITs) that lower expressions like bucket = value / width where width is constant at plan time
    • Tools/workflows: Adopt patched LLVM in the JIT; ensure constant folding/promotion exposes constants before codegen
    • Assumptions/dependencies: The divisor is a constant at JIT time; JIT uses LLVM back-end; target is 64-bit
  • Cryptography and post-quantum cryptography implementations
    • Sectors: Security, finance, embedded/mobile
    • What: Constant-division that remains in pack/unpack/encoding steps, digit/base conversions, or certain parameter rescalings can speed up; the authors reference ML-KEM/ML-DSA ecosystems and provide related repos
    • Tools/workflows: Rebuild libs with LLVM trunk or the first release carrying the patch; where needed, drop-in a header-only “constdiv” routine that emits mulhi/umulh directly for portability
    • Assumptions/dependencies: Many crypto reductions avoid division (Barrett/Montgomery) so benefits depend on code structure; divisors must be compile-time constants in hot paths
  • LLVM-based numerical and ML compilers
    • Sectors: AI/ML, scientific computing
    • What: Code generators (TVM, XLA backends using LLVM, Halide, MLIR-based pipelines) often materialize constant scaling (e.g., quantized kernels); constant divisions can be lowered to mulhi
    • Tools/workflows: Update LLVM in the compiler stack; ensure constant-propagation occurs before lowering; add IR patterns that preserve constant-division forms
    • Assumptions/dependencies: Scales are constants at compile/JIT time; the target CPU supports fast 64×64→128 mul with cheap access to the high half (x86-64/AArch64)
  • Energy and battery efficiency improvements
    • Sectors: Mobile/edge devices, data centers
    • What: Fewer instructions and lower runtime for constant-division-heavy loops reduce energy per operation
    • Tools/workflows: Recompile with updated LLVM; include this optimization in energy/perf regression dashboards; A/B test power under representative workloads
    • Assumptions/dependencies: Workload contains nontrivial volumes of 32-bit constant divisions in the 33-bit c class; DVFS and scheduling effects may modulate realized savings
  • Static optimization and binary tooling
    • Sectors: Tooling, performance engineering
    • What: LLVM opt/llc, BOLT, or custom MachineFunction passes can canonicalize older GM sequences into the new single-multiply form on 64-bit targets
    • Tools/workflows: Integrate a post-link binary optimizer pass; pattern-match the 3-shift GM sequence and replace with mulhi form
    • Assumptions/dependencies: Reliability of pattern detection; target CPU and ABI constraints; full correctness across corner cases
  • Developer guidance and code review checklists
    • Sectors: Software engineering, education
    • What: Update guidelines to (a) keep constant divisors as compile-time constants, (b) avoid manual sequences that block optimization, and (c) rely on the compiler to emit the optimal mulhi-based code
    • Tools/workflows: Linters or Clang-Tidy checks to detect manual patterns; CI enforcing modern compilers
    • Assumptions/dependencies: Teams can upgrade toolchains; codebases allow minor refactors to expose constants
  • Interim portability via micro-libraries
    • Sectors: Embedded, cross-platform systems
    • What: Use a small, header-only helper that precomputes 2^(64−a) * c and applies mulhi to compute x / d for u32/constant d, ensuring consistent performance across compilers
    • Tools/workflows: Integrate “constdiv” (as cited by the authors) while waiting for GCC adoption or for vendor LLVM releases to roll out
    • Assumptions/dependencies: Careful testing for all edge cases; ensure ABI constraints (e.g., inline asm vs. portable intrinsics) are respected

Long-Term Applications

These require additional research, engineering, or ecosystem adoption before broad deployment.

  • Generalization to 64-bit dividends by constants on 64-bit targets
    • Sectors: Software, HPC, databases, cryptography
    • What: Extend the “scale-then-mulhi” idea to u64/constant divisions (where “magic” may be 65 bits) to avoid 128-bit shifts in quotient computation
    • Dependencies: New proofs/derivations for bounds; careful codegen on architectures with different mul/shift trade-offs; compiler integration and benchmarking across microarchitectures
  • Vectorized/SIMD constant division
    • Sectors: AI/ML, graphics, signal processing, databases
    • What: Introduce lane-wise mulhi strategies in NEON/SVE and AVX2/AVX-512 for vectorized u32/constant divisions; benefit columnar scans and image/audio kernels
    • Dependencies: Instruction availability (e.g., per-lane high-half multiplies); vectorizer support in LLVM; ensuring the transform’s profitability model in the presence of lane permutations
  • Unified quotient+remainder optimization
    • Sectors: Compilers, systems libraries
    • What: Combine this quotient method with state-of-the-art remainder algorithms (e.g., “Faster Remainder by Direct Computation”) to pick optimal sequences for / and % by constants
    • Dependencies: Compiler heuristics that compose transformations, reduce register pressure, and respect microarchitectural latencies
  • PGO-driven specialization for constant divisions
    • Sectors: Software at scale, cloud services
    • What: Use Profile-Guided Optimization to identify hot constant-division sites, ensure the constant form is preserved, and auto-select the best codegen variant per CPU family
    • Dependencies: PGO infrastructure; IR stability to retain constants; CPU-specific cost models
  • Domain-specific JIT adoption (quantization-heavy inference/graphics pipelines)
    • Sectors: AI inference, AR/VR, robotics
    • What: Embed codegen rules in DSLs (Halide/TVM/MLIR) to prefer mulhi lowering for compile-time scales (e.g., uniform quantization steps or fixed bin widths)
    • Dependencies: Ensuring those scales are known at JIT-time; cross-target testing; maintaining numerical equivalence (rounding vs. truncation behavior)
  • Hardware/ISA feedback loop
    • Sectors: Semiconductor, CPU design
    • What: Encourage high-throughput, low-latency access to the high half of multiplies (e.g., streamlined mulhi without RAX/RDX dependencies on x86) to further improve this and similar idioms
    • Dependencies: ISA evolution cycles; microarchitectural design trade-offs; compiler support to detect and exploit new instructions
  • Formal verification and correctness tooling
    • Sectors: Safety-critical systems, academia
    • What: Mechanize proofs for the scaled-magic approach, build compiler verification tests, and integrate into formal frameworks for arithmetic code transformations
    • Dependencies: Proof frameworks (Coq/Isabelle), conformance suites, and cooperation with compiler vendors
  • Policy and sustainability initiatives
    • Sectors: Public sector IT, energy
    • What: Incorporate modern compiler requirements (with optimizations like this) into procurement to reduce compute energy budgets at scale
    • Dependencies: Coordination with OS vendors and distributions; certification baselines that include compiler versioning and benchmarks

Notes on feasibility and scope

  • Trigger conditions: The optimization applies to 32-bit unsigned division by a compile-time constant where the GM “magic” constant c is 33 bits; other cases (power-of-two divisors, very large divisors, or 32-bit c) already have efficient code paths.
  • CPU prerequisites: 64-bit targets with fast 64×64→128 multiplication and cheap access to the high 64 bits (e.g., x86-64 mul to RDX:RAX; AArch64 umulh).
  • Expected impact: Microbenchmarks show 1.67× (Sapphire Rapids) to 1.98× (Apple M4) speedups for affected divisions; end-to-end gains depend on how frequently such divisions occur in hot paths.
  • Rollout status: LLVM change is merged to llvm:main; GCC patch is under review, so timelines vary by distribution and vendor toolchains.

Glossary

  • AArch64: A 64-bit ARM architecture used in modern ARM and Apple processors. "On AArch64 architectures such as Apple M4, the {\tt u64}×{\tt u64}={\tt u128} multiplication is split into {\tt umulh}, which returns the upper 64 bits, and {\tt mul}, which returns the lower 64 bits."
  • Apple M4: An Apple-designed ARM-based processor used as a 64-bit target in the paper’s evaluations. "and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later."
  • BMI2 (-mbmi2): An x86-64 instruction set extension providing advanced bit-manipulation operations; the -mbmi2 flag enables it in code generation. "For x86-64, the comparison used -O2 -mbmi2."
  • Clang: A C/C++/Objective-C compiler frontend based on LLVM. "including GCC, Clang, Microsoft Compiler, and Apple Clang."
  • GM method: The Granlund–Montgomery technique for replacing division by invariant integers with multiplication and shifting. "Their method (called the GM method in this paper)"
  • latency: The time (in cycles) for an instruction to produce its result. "the 128-bit logical right-shift instruction {\tt shrd} requires the same latency and throughput as {\tt mul} on Skylake-X \cite{anger}."
  • llc: LLVM’s static compiler backend that lowers LLVM IR to machine assembly. "Then we used LLVM's llc to generate assembly for each CPU."
  • logical right-shift: A bit shift that inserts zeros on the left, used to emulate unsigned division by powers of two. "the 128-bit logical right-shift instruction {\tt shrd}"
  • LLVM: A modular compiler infrastructure providing toolchains and backends for multiple architectures. "We implemented the proposed optimization in LLVM (the compiler infrastructure of Clang)"
  • LLVM IR: LLVM’s typed, low-level intermediate representation used for machine-independent optimizations. "First, we generated LLVM IR bench.ll"
  • microbenchmark: A small, focused benchmark designed to measure the performance of a specific operation. "in the microbenchmark described later."
  • Sapphire Rapids: Intel Xeon server microarchitecture used as an evaluation platform. "Intel Xeon w9-3495X (Sapphire Rapids)"
  • shrd: An x86 instruction that performs a doubleword (128-bit) right shift across two registers. "the 128-bit logical right-shift instruction {\tt shrd} requires"
  • Skylake-X: An Intel high-end desktop/server microarchitecture used for performance characterization. "on Skylake-X \cite{anger}"
  • SoC (System on Chip): An integrated circuit that consolidates CPU, memory, and peripherals on a single chip. "(Apple M-series SoC)"
  • throughput: The rate at which instructions can be issued/retired (e.g., instructions per cycle). "the 128-bit logical right-shift instruction {\tt shrd} requires the same latency and throughput as {\tt mul} on Skylake-X \cite{anger}."
  • u32/u64/u128: Unsigned integer types with 32-, 64-, and 128-bit widths, respectively. "We denote 32/64/128-bit unsigned integer types by {\tt u32/u64/u128}."
  • umulh: An AArch64 instruction that returns the upper 64 bits of a 128-bit product of two 64-bit integers. "is split into {\tt umulh}, which returns the upper 64 bits"
  • x86-64: The 64-bit extension of the x86 architecture used by modern Intel/AMD CPUs. "For x86-64, the comparison used -O2 -mbmi2."
  • zero-extended: An operation that widens an integer to a larger width by filling higher bits with zeros. "where xx is zero-extended to 64 bits."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 147 likes about this paper.