Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search

Published 5 Apr 2026 in cs.PF | (2604.04311v1)

Abstract: An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.

Authors (1)

Summary

  • The paper introduces a context-aware graph search approach that models first-order dependencies to optimize SIMD instruction scheduling for FFTs.
  • It demonstrates that a context-aware DAG yields a 34% performance improvement over context-free models and a 5.2× speedup over pure radix-2 scheduling.
  • The framework is hardware-sensitive, enabling architecture-specific scheduling optimizations that can extend to other multi-stage computational pipelines.

Problem Formulation and Motivation

The Fast Fourier Transform (FFT) is a central computational primitive whose efficient implementation on modern hardware demands close attention to low-level instruction scheduling, cache locality, and register resources. Despite all valid decompositions of the Cooley-Tukey FFT algorithm producing identical results, their instruction-level implementations differ drastically in terms of achievable throughput due to variations in algorithm stages (radix-2, radix-4, radix-8), register blocking, and cache behavior.

Historically, tools like FFTW and SPIRAL have approached this challenge using heuristic or empirical search strategies, often relying on dynamic programming with the optimal substructure assumption—that is, claiming that the fastest way to transform a partial result is independent of its computational context. However, these assumptions break down due to real-world effects such as cache state transitions and register pressure, which create first-order dependencies between scheduling decisions.

This work systematically addresses and resolves the limitations of the optimal substructure assumption by recasting the problem as a shortest-path search in a specifically constructed directed acyclic graph (DAG). Critically, the graph is augmented to encode context, such that edge weights capture not only the cost of an operation but also how that cost depends on the operation's preceding context—primarily through cache and register state.

Context-Free vs. Context-Aware Graph Models

The baseline, context-free DAG formulation represents each computation stage as a node and each valid FFT scheduling alternative (e.g., a radix-2 pass, a fused register block) as an edge, with empirically measured cost as the edge weight. Solution paths from initial to final nodes represent specific FFT decomposition choices; the shortest path corresponds to the fastest scheduling.

The context-aware variant expands the node-space to explicitly encode the predecessor edge (operation type), so that each node represents not just the current computation stage but the kind of operation that preceded it. This allows edge weights to be conditioned on the operation history, thus capturing inter-operation effects such as cache warming and register spill/fill costs. Measurement protocols are adapted: for each potential transition, the previous operation is executed (untimed) before the candidate operation is timed, providing a realistic cost measure that reflects memory and cache residencies.

The context-aware model introduces a state-space size increase by a multiplicative factor of the number of edge types but remains computationally tractable (e.g., 77 nodes for N=1024N=1024) and, crucially, models performance-affecting correlations that context-free abstractions miss.

Implementation Details on Apple M1 NEON

The graph-based framework is evaluated on the Apple M1 NEON SIMD engine. The implementation comprises highly optimized butterfly cores (in split-complex storage), and several fused register-blocked FFT kernels (newly including an FFT-32 block, uniquely suited to NEON's 32-register architecture).

Empirical measurement demonstrates that maximizing block size is not universally optimal; register pressure penalizes large fused blocks (FFT-32 underperforms FFT-16 and FFT-8 due to insufficient registers for twiddle storage). The search framework automatically detects these tradeoffs due to direct measurement of each planning alternative in realistic context.

Experimental Results

The context-aware graph search demonstrates strong quantitative improvements:

  • Context-aware search achieves 29.8 GFLOPS on M1 NEON for N=1024N=1024 split-complex FFT.
  • This is a 5.2×5.2\times speedup over pure radix-2 scheduling and 34% faster than the best context-free search result.
  • The identified optimal sequence is R4 \to R2 \to R4 \to R4 \to Fused-8—a pattern that would not have been proposed by any context-free or maximize-radix search, due to the non-obvious inclusion of a radix-2 pass whose cost is beneficial only because of the specific cache state left by the prior R4.
  • Classical “maximize the radix at every step” strategies fail: pure radix-8 variants achieve only 25% of optimal throughput.
  • The framework is architecture-sensitive: the optimal schedule for AVX2 (Intel Haswell) diverges drastically from M1 NEON, confirming that hardware-specific costs must be empirically measured and integrated into decomposition search.

The approach’s computational burden for search and measurement is negligible compared to even established libraries’ planning overheads (e.g., FFTW’s codelet evaluation).

Theoretical and Practical Implications

By explicitly modeling first-order Markov dependencies in stage scheduling, the framework formalizes, quantifies, and operationalizes an effect recognized but previously sidestepped in FFT research (especially in FFTW and SPIRAL). The empirical performance difference underscores the inadequacy of context-free decompositions and highlights the methodological advantage of incorporating context into instruction scheduling.

This approach generalizes beyond FFT to any staged signal processing or scientific computing pipeline where low-level operations have non-independent performance characteristics and where prior state (cache, register, pipeline) is a nontrivial factor. For instance, matrix factorizations, multi-stage filters, and neural network layer fusion can benefit from similar context-aware shortest-path search formulations.

The model also enables straightforward porting to new hardware: edge weights are re-measured, and the shortest-path search is rerun, yielding a new schedule matched to the actual microarchitectural details of the new target.

Conclusion

Context-aware shortest-path graph search offers a principled, formal solution for optimal SIMD instruction scheduling of FFTs, capturing cache and register correlations neglected by previous approaches. The demonstrated 34% gain over context-free models validates the method’s necessity on modern CPUs. Architecture-specific, non-obvious decompositions can be discovered automatically, obviating the need for human-crafted heuristics or oversimplified dynamic programming. The method is of broad applicability wherever stage-local costs are context-sensitive.

Source code is available at: https://github.com/aminems/fft.

Reference: "Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search" (2604.04311).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Shortest-Path FFT: What This Paper Is About (In Simple Terms)

Big Picture

This paper is about making a common math tool, the Fast Fourier Transform (FFT), run as fast as possible on modern computer chips. An FFT is used in lots of tech—like music apps, image processing, and science—to turn signals from “time” into “frequency.” The trick is: there are many different ways to program an FFT that all give the same answer, but some ways are much faster than others depending on the computer.

The authors show a new, simple idea: pick the fastest FFT the same way a GPS finds the quickest route—by turning choices into a map and running a shortest-path search.


1) Overview: The Main Topic

An FFT can be built from different “building blocks” and in different orders. Each choice uses different computer instructions and memory patterns. The paper turns all these choices into a graph (like a map), measures how long each choice takes, and then uses a shortest-path algorithm (Dijkstra’s algorithm—like a GPS for graphs) to find the fastest overall plan. They also show that to truly get the best result, you must consider what just happened before each step because that changes how fast the next step runs.


2) What Questions Are They Trying to Answer?

  • How can we automatically choose the fastest way to run an FFT on a specific chip?
  • If we think of FFT choices as a “map,” can a shortest-path search find the fastest plan?
  • Do we need to consider “context”—what the previous step did—to predict the next step’s speed?
  • Can this method beat common rules of thumb, like “always use bigger steps” or “use the same best building block everywhere”?

3) How They Did It (With Easy Analogies)

Think of cooking a recipe:

  • You need to do several stages. Each stage can be done in a few different ways (like chopping with a knife or using a food processor).
  • Even if two methods produce the same result, one might be faster depending on what you just did (e.g., the cutting board is already out and clean).

Now translate that to computers:

  • The FFT has “stages” you must do in order.
  • Each stage can be done with different “radixes” (radix-2, radix-4, radix-8) or by “fusing” several stages so the data stays in super-fast storage.
  • “Registers” are like the chef’s hands—very fast, but you can’t hold much at once.
  • The “cache” is like a nearby shelf; it’s faster than the pantry (main memory). If you just used something, it might still be warm and ready (a “warm” cache), making the next step faster. That’s called a cache effect.

Their approach:

  • Build a graph where each node is “how many FFT stages are done” and each edge is “a way to do the next 1–5 stages” (like choosing knife vs. food processor).
  • Measure how long each edge takes on the real computer (don’t guess).
  • Run Dijkstra’s algorithm (a shortest-path search) to find the fastest sequence from start to finish.

Context-free vs. context-aware:

  • Context-free: assume each step takes the same time no matter what came before.
  • Context-aware: expand the graph to remember what the previous step was. Now the time for the next step reflects what just happened—like knowing the oven is already warm, so baking starts faster.

Why context matters:

  • If the previous step warmed up the cache, the next step may run quicker. Ignoring that can lead you to pick a slower overall plan.

4) Main Findings (And Why They Matter)

On an Apple M1 chip (using NEON instructions), the authors tested many combinations and found:

  • The context-aware shortest-path method picked a plan that ran at about 29.8 GFLOPS. That’s:
    • About 5.2 times faster than doing everything the simplest way (pure radix-2).
    • About 34% faster than the best plan found when ignoring context.
  • The winning plan was: R4 → R2 → R4 → R4 → Fused-8. This means:
    • Start with a radix-4 pass,
    • then do a radix-2 pass,
    • then two more radix-4 passes,
    • then end with a fused block that keeps data in the fastest storage (registers) for several steps.
  • This plan is surprising. If you ignore context, you’d never choose that single radix-2 in the middle. But because the previous step “warms the cache,” that specific radix-2 at that moment is actually faster.
  • Fused blocks (doing several stages entirely in registers) are extremely powerful—keeping data “in your hands” is faster than moving it back and forth to memory.
  • Bigger is not always better. A large fused block that uses many registers (like FFT-32) looked promising but ended up slower due to “register pressure” (you run out of hands to hold everything).
  • What’s best depends on the chip. The best plan for Apple M1 isn’t the same as for Intel processors. The method adapts: measure again and run the search.

Why it matters:

  • Instead of guessing or using one-size-fits-all rules, you can measure and search to find the best plan for each device.
  • Context—what you just did—can change what you should do next. Modeling that can give a big speed boost.

5) What This Means for the Future

  • Faster apps: Since FFTs are used in music, images, radar, and more, making them faster speeds up many technologies.
  • Portable optimization: On any new chip, measure the step costs and re-run the search. You quickly get a new best plan without rewriting everything by hand.
  • Beyond FFTs: The same “shortest-path with context” idea can optimize other multi-step computations where memory and “what just happened” matter—like matrix operations or combining layers in a neural network.

In short: the paper shows a smart, practical way to turn “picking the fastest code” into a map-and-GPS problem—then proves that remembering recent history (context) helps you find a faster route.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of specific gaps and unresolved questions that future work could address to strengthen and extend the paper’s findings.

  • Scope of evaluation is narrow:
    • Only a single problem size is evaluated (N=1024N=1024), with complex float32, split-complex layout, and DIF; it is unclear how results and optimal plans vary across sizes (small and very large NN), precisions (float16/64), and algorithmic variants (DIT vs. DIF).
  • Single-architecture validation:
    • Results are reported only on an Apple M1 P-core (NEON, 128-bit); there is no empirical validation on other CPUs (e.g., AVX2/AVX-512, ARMv9 SVE, AMD Zen), mobile ARM cores, or GPUs to substantiate portability claims.
  • No comparison to state-of-the-art libraries:
    • The paper does not benchmark against FFTW, Apple vDSP/Accelerate, or VkFFT on the same hardware and problem size, leaving open whether the discovered plan is competitive in absolute terms.
  • First-order context assumption:
    • Context awareness conditions only on the immediately preceding operation type (k=1k=1); the paper does not quantify how much additional accuracy/performance is gained by k>1k>1 (e.g., k=2k=2) or identify when longer-range cache/prefetch correlations materially affect optimality.
  • Composability of measured edge weights:
    • The approach assumes that the sum of context-conditioned per-edge times predicts end-to-end runtime; the paper does not report prediction error (measured vs. predicted) or validate that additive composition holds when microarchitectural interactions extend beyond one step.
  • Context token granularity:
    • The “context” encodes only the predecessor edge type; it does not consider other salient state such as stride/working-set class, alignment, page coloring, TLB state, hardware prefetcher phase, or twiddle-cache residency that may further disambiguate costs.
  • Edge set completeness and generality:
    • Only R2/R4/R8 and three fused blocks (8/16/32) are explored; it remains unclear how to extend the edge set to:
    • Other radices (needed for non–power-of-two sizes),
    • Alternative fused shapes (e.g., 4, 24, 64) and partial fusions,
    • Stockham variants, transposition steps, or mixed in-place/out-of-place strategies.
  • Cache-blocking and tiling beyond last-stage fusions:
    • The framework focuses on fusing consecutive late stages in-register; it does not model higher-level cache tiling (e.g., L1/L2 blocking with transposes) or include transpose edges necessary for large-NN and multidimensional FFTs.
  • Multicore and NUMA scaling:
    • The method is evaluated on a single core; there is no model for inter-thread cache contention, bandwidth saturation, NUMA placement, or synchronization edges, and no demonstration of how the graph/search generalizes to parallel FFTs.
  • Batched and streaming use cases:
    • Edge weights are measured for single transforms; the impact of batching, repeated plans on hot data, or streaming pipelines (where steady-state cache behavior differs) is not characterized.
  • Measurement methodology and robustness:
    • While medians over 50 trials and 3 runs are reported, the paper does not detail pinning, frequency throttling controls, cache-flush protocols, or confidence intervals; sensitivity to OS noise, DVFS, and thermal state is not quantified.
  • Planning-time scaling and reuse:
    • For real applications requiring many NN, precisions, and layouts, the cumulative measurement burden is unclear; strategies to share/reuse weights across sizes, interpolate/extrapolate costs, or amortize planning time are not explored.
  • Predictive models to reduce measurement:
    • There is no attempt to learn or regress edge weights from static features (e.g., register pressure, stride classes, twiddle locality) to reduce measurement count while preserving accuracy.
  • Numerical accuracy is not assessed:
    • Different arrangements and fusions (FMA usage, twiddle ordering) can alter floating-point error; the paper does not quantify accuracy trade-offs or provide bounds/ULP statistics across plans.
  • Data layout sensitivity:
    • Only split-complex storage is evaluated; the effect of interleaved complex layouts, SoA/AoS variations, and twiddle-table organizations on edge weights and optimal plans is not examined.
  • Code-size and I-cache effects:
    • Fused blocks may increase code size and pressure the instruction cache; the impact on performance and whether a multi-objective search (time vs. code size) changes the chosen plan is not investigated.
  • Prefetching and alignment strategies:
    • The framework does not explore explicit prefetch instructions, alignment/padding choices, or address-generation patterns as searchable edges, leaving potential gains unquantified.
  • Improving underperforming fused blocks:
    • FFT-32 underperforms due to register pressure; the paper does not investigate alternative scheduling (e.g., rematerializing twiddles, partial spilling, sub-blocking) that might make larger fusions viable on NEON or other ISAs.
  • Generalization to multidimensional FFTs:
    • Extending the graph to 2D/3D FFTs requires modeling transpose phases and interleaving of dimension-wise passes; the paper leaves the design of such edges, and the role of context across transposes, as open.
  • Edge-definition reproducibility:
    • The exact code-generation mechanism for each edge (radix/fused variant, stage-position specialization) and how correctness is verified are not fully specified, limiting reproducibility and adoption.
  • Sensitivity to twiddle-factor handling:
    • Only precomputed table loads are considered; the trade-off space between loading vs. on-the-fly twiddle generation (and its effect on registers/cache) is not included in the graph.
  • Porting to accelerators:
    • While claiming generalization, the approach is not demonstrated on GPUs/NPUs where kernel launch overheads, shared memory, and warp scheduling create different “context” dynamics; how to define and measure edges in those environments is open.

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, mapped to sectors and accompanied by potential tools/workflows and key dependencies.

  • Software libraries (FFT ecosystems)
    • Use case: Integrate the context-aware shortest-path planner into FFT libraries (e.g., FFTW, VkFFT, Apple Accelerate/vDSP, ARM Ne10, FFTS) to select architecture-specific radix/fused-block sequences automatically.
    • Tools/workflows: Add a planner module that benchmarks conditional edge weights on install/first run and emits the optimal plan for each target CPU (NEON/AVX2/AVX-512).
    • Dependencies/assumptions: Availability of codelets for radix-2/4/8 and fused blocks; quick, reliable microbenchmark harness; first-order context (prev-op) sufficiently approximates cache effects for most devices; ability to ship or JIT architecture-specific microkernels.
  • Mobile/edge apps (daily life: audio/image/AR)
    • Use case: Deploy faster FFTs for audio processing (spectral effects, convolution reverb), image filtering, and AR pipelines on ARM-based mobile devices to reduce latency and battery usage.
    • Tools/workflows: Bundle a context-aware FFT planner in app DSP layers; calibrate on first app launch; cache the plan per device model.
    • Dependencies/assumptions: NEON-specific kernels; support for split-complex or equivalent layouts; predictable OS scheduling during calibration; small one-time tuning cost.
  • Communications/Networking (industry: baseband DSP)
    • Use case: Accelerate OFDM FFTs in LTE/5G/Wi‑Fi baseband stacks for lower latency and power draw in SDRs and embedded radios.
    • Tools/workflows: Firmware-integrated planner that runs offline (factory or boot-time) to pick plans per hardware unit; ship plan as a firmware configuration.
    • Dependencies/assumptions: Real-time constraints preclude tuning during operation; deterministic timing needed; fixed FFT sizes make planning a one-time cost.
  • Healthcare (MRI reconstruction, ultrasound beamforming)
    • Use case: Speed up FFT-heavy reconstruction pipelines on CPUs/SoCs used in scanners or edge devices for preliminary imaging.
    • Tools/workflows: Replace library FFTs in reconstruction toolchains with planner-optimized kernels; continuous integration (CI) step to calibrate per deployment hardware.
    • Dependencies/assumptions: Float32/float64 numerical behavior matches clinical tolerances; regulator-compliant validation; access to target hardware for calibration.
  • Finance (signal analysis for time series)
    • Use case: Faster spectral analysis in back-testing pipelines and live analytics on server CPUs (AVX2/AVX-512); improve throughput and reduce compute costs.
    • Tools/workflows: Integrate planner into internal DSP libraries; run on each server SKU to cache optimal plans; A/B test against baseline FFT.
    • Dependencies/assumptions: Availability of fused-block codelets for x86; stable CPU frequency/governor during tuning; cluster-wide plan caching by CPU model.
  • HPC/scientific computing (academia and industry)
    • Use case: Drop-in planner-led FFTs for spectral PDE solvers, Poisson solvers, CFD codes, and FFT-based convolutions to reduce time-to-solution.
    • Tools/workflows: Cluster-wide calibration job that profiles per node type; plans stored in module files or container images; transparent to application code.
    • Dependencies/assumptions: Scheduler allowances to run microbenchmarks on nodes; minimal jitter; reproducible performance across nodes of same SKU.
  • Game/media engines (audio DSP in Unreal/Unity)
    • Use case: Reduce latency for in-game audio effects and spectral processing on consoles/SoCs by adopting planner-optimized FFTs.
    • Tools/workflows: Update DSP plugins to call into a planner-backed FFT backend; precompute per-platform plans shipped with the game.
    • Dependencies/assumptions: Maintain plugin ABI; cross-platform fallbacks for platforms without NEON/AVX; QA validation across device SKUs.
  • Robotics (SLAM and perception pipelines)
    • Use case: Accelerate FFT-based correlation and filtering steps on ARM-based robots to improve real-time performance and power efficiency.
    • Tools/workflows: Integrate the planner into ROS/embedded DSP stacks; tune per-board at manufacturing or provisioning time.
    • Dependencies/assumptions: Pipeline actually uses FFT-based methods; deterministic timing constraints; modest tuning time budget.
  • Compilers/auto-scheduling (software tooling)
    • Use case: Add a “shortest-path kernel planner” pass to MLIR/LLVM/Halide/TVM for FFTs, treating fused blocks as first-class search choices.
    • Tools/workflows: A pass that calls a measurement runner to populate conditional edge weights and outputs codegen schedules.
    • Dependencies/assumptions: Integration effort; stable microbenchmarks; per-target codegen for fused blocks; acceptable build/tuning overhead.
  • Hardware performance characterization (QA/benchmarking)
    • Use case: Use the context-aware graph and microbenchmarks to profile cache and register effects on new CPUs/SoCs.
    • Tools/workflows: Internal lab harness that produces heatmaps of conditional costs for edge types and stages.
    • Dependencies/assumptions: Controlled environment (frequency, thermals); capability to pin cores and isolate noise.

Long-Term Applications

These opportunities require further research, scaling, or ecosystem development before routine deployment.

  • Generalization beyond FFTs (software, AI systems)
    • Use case: Apply context-aware shortest-path scheduling to other staged computations: DCT/DST, Winograd/overlap‑save convolution, multi-stage filters, sparse factorizations (QR/LU), and neural network layer fusion.
    • Tools/workflows: Build codelet libraries and microbench harnesses per domain; extend planners to new operator types with fused alternatives.
    • Dependencies/assumptions: Availability of high-quality codelets; larger search spaces; domain-specific correctness and numerical stability; need for shape/dtype awareness.
  • Higher-order context modeling (k > 1)
    • Use case: Capture longer-range cache/prefetch effects by conditioning edge weights on the last k operations (e.g., two-step context for prefetch pipelines).
    • Tools/workflows: Configurable planner with k-context; sampling strategies to limit measurement explosion; pruning heuristics.
    • Dependencies/assumptions: Combinatorial growth of nodes (|T|k); need for beam-search or learning-guided pruning; risk of overfitting to microbenchmark conditions.
  • ML-guided edge-weight prediction (software tooling)
    • Use case: Train models to predict conditional edge costs, reducing the need for exhaustive measurements and enabling cross-device generalization.
    • Tools/workflows: Collect datasets across devices; feature engineering (stride, register pressure, cache line usage); hybrid measure+predict planners.
    • Dependencies/assumptions: Model robustness to OS jitter and thermal drift; maintainability across hardware generations; validation against ground-truth measurements.
  • GPU and heterogeneous scheduling (HPC, AI accelerators)
    • Use case: Extend context-aware search to GPUs (shared memory/register file tradeoffs) and CPU–GPU pipelines, choosing fused kernels and data layouts to exploit on-chip memory.
    • Tools/workflows: GPU codelets (e.g., CUDA/HIP/Metal) with tunable shared-memory tiles and fusion options; task-graph planners (e.g., for VkFFT/GFFT).
    • Dependencies/assumptions: Non-determinism and concurrency complicate measurements; larger thread/block scheduling spaces; portable measurement APIs.
  • Multi-size and multidimensional FFTs (HPC, imaging)
    • Use case: Support arbitrary radices, mixed-radix factors, and 2D/3D batched FFTs by expanding edge types (e.g., transpose, tiling, fusion across dimensions).
    • Tools/workflows: Codelet expansion for additional radices and tensor transposes; DAGs that include layout transformation edges.
    • Dependencies/assumptions: Increased planner complexity; memory bandwidth and transpose costs dominate—need multi-objective planning.
  • Energy-aware or multi-objective planning (mobile/edge, green computing)
    • Use case: Optimize for energy (Joules) or latency–energy Pareto fronts rather than time alone, to extend battery life or meet thermal envelopes.
    • Tools/workflows: Edge-weight measurement augmented with on-device power sensors; multi-objective shortest-path or weighted-sum planners.
    • Dependencies/assumptions: Accurate energy measurement APIs; variability due to DVFS; policy to select trade-offs at runtime.
  • Online/adaptive planners (systems/runtime)
    • Use case: Runtime selection and hot-swapping of plans based on device state (temperature, DVFS, co-runner interference) and input size distributions.
    • Tools/workflows: Low-overhead monitors; plan repositories; confidence-based switching; integration with JITs.
    • Dependencies/assumptions: Minimal perturbation from online measurement; safeguards against oscillations; predictable real-time behavior.
  • ISA/hardware co-design feedback (semiconductor industry)
    • Use case: Use planner insights to guide ISA and microarchitecture (e.g., register file size, cache line size, fused instructions) by simulating performance under different designs.
    • Tools/workflows: Pre-silicon simulators coupled with the planner to evaluate candidate designs; design-space exploration dashboards.
    • Dependencies/assumptions: Access to accurate simulators; alignment between microbench metrics and full-application behavior.
  • Standardization and policy (benchmarking and procurement)
    • Use case: Develop reproducible, context-aware microbenchmarking protocols for kernel selection used in academic publications and government/enterprise procurement.
    • Tools/workflows: Open benchmark suites and reporting guidelines capturing conditional costs and planning outcomes.
    • Dependencies/assumptions: Community consensus; careful handling of system variability; neutrality across vendors.
  • OTA plan distribution and device fleet management (platform vendors)
    • Use case: Ship per-SKU optimal plans via OS/library updates (e.g., platform FFT backends) to improve performance and energy across device fleets.
    • Tools/workflows: Cloud-based tuning farms; signed plan bundles; telemetry to validate impact.
    • Dependencies/assumptions: Secure update channels; device diversity management; regression testing at scale.

Glossary

  • AVX2: An x86 SIMD instruction set with 256-bit vectors; here noted for its 16 vector registers. "ARM NEON's 32 registers (vs.\ AVX2's 16) enable a fused block keeping 5 DIF passes in registers."
  • beam-width heuristic: A search strategy that keeps only the top k candidates at each step to limit exploration. "and addressed this with a beam-width heuristic."
  • butterfly: The basic operation in FFTs that combines inputs into outputs via add/subtract and twiddle multiplications. "The DIF butterfly computes topout=top+bot\text{top}_\text{out} = \text{top} + \text{bot} and botout=(topbot)W\text{bot}_\text{out} = (\text{top} - \text{bot}) \cdot W for 4 parallel butterflies per NEON instruction."
  • cache residuals: Recently accessed cache lines that remain hot and can be reused by subsequent operations. "exploiting cache residuals that only exist in context."
  • cache warming: The effect where prior accesses pre-load data into cache, reducing the cost of subsequent operations. "so that edge weights capture inter-operation correlations such as cache warming---the cost of operation~B depends on which operation~A preceded it."
  • Cooley–Tukey algorithm: A classic FFT algorithm that recursively decomposes the transform into smaller radices. "the Cooley-Tukey algorithm \citep{CooleyTukey1965} requires exactly LL stages of butterfly computation"
  • codelet: A small, specialized FFT fragment used by planners for benchmarking and composition. "FFTW \citep{FrigoJohnson2005,FrigoJohnson1998} addresses this by empirically benchmarking ``codelets'' (small specialized FFT fragments) and combining the fastest ones."
  • context-aware edge weights: Edge costs measured conditional on the preceding operation to capture inter-operation effects. "we introduce context-aware edge weights: the graph's node space is expanded to encode the predecessor edge type"
  • context-aware model: A graph model where nodes encode both stage and the type of the preceding operation. "In the context-aware model, nodes are expanded to encode the predecessor edge type, so that edge weights capture inter-operation correlations"
  • context-free model: A graph model assuming edge weights are independent of preceding operations. "In the context-free model, nodes represent computation stages and edge weights are independently measured instruction costs."
  • DAG (directed acyclic graph): A graph with directed edges and no cycles; used here to model FFT stage transitions. "We define a weighted DAG G=(V,E,w)G = (V, E, w)"
  • Decimation-in-Frequency (DIF): An FFT variant organizing computations by splitting in the frequency domain. "The DIF butterfly computes topout=top+bot\text{top}_\text{out} = \text{top} + \text{bot} and botout=(topbot)W\text{bot}_\text{out} = (\text{top} - \text{bot}) \cdot W"
  • Dijkstra: A shortest-path algorithm used to find the fastest FFT arrangement in the graph. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal)."
  • FFTW: A widely used FFT library that plans transforms by benchmarking and composing codelets. "FFTW \citep{FrigoJohnson2005,FrigoJohnson1998} addresses this by empirically benchmarking ``codelets'' (small specialized FFT fragments) and combining the fastest ones."
  • FMA (fused multiply-add): An instruction that performs a multiply and add in a single operation. "Single Apple M1 P-core (Firestorm, 3.2~GHz, 128-bit NEON, 2 FMA units)."
  • fused register block: A sequence of FFT passes executed entirely in registers before storing back to memory. "Fused blocks load BB points into SIMD registers, compute log2B\log_2 B passes entirely in-register, then store."
  • GFLOPS: Billions of floating-point operations per second; a performance metric. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal)."
  • hardware prefetch state: The internal state of the hardware prefetcher that affects memory access performance. "the execution time depends on complex interactions between instruction scheduling, cache hierarchy behavior, register pressure, and hardware prefetch state."
  • in-register: Computation performed entirely within CPU registers without intermediate memory traffic. "In-register; zero memory traffic"
  • Markov property: The assumption that the next cost depends only on the current state (e.g., the immediate predecessor), not on earlier history. "it directly models the cache correlation as a first-order Markov property in the search graph."
  • NEON: ARM’s SIMD instruction set extension used on Apple M1 for vectorized computation. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement"
  • optimal substructure: A property where optimal solutions to subproblems compose into an optimal overall solution. "FFTW's dynamic programming assumes optimal substructure: the best codelet for a sub-problem remains best regardless of context."
  • radix-2 pass: An FFT stage computing butterflies of size 2, advancing by one stage. "Radix-2 pass"
  • radix-4 pass: An FFT stage computing butterflies of size 4, advancing by two stages. "Radix-4 pass"
  • radix-8 pass: An FFT stage computing butterflies of size 8, advancing by three stages. "Radix-8 pass"
  • register blocking: A strategy that schedules computations to keep data in registers across steps to reduce memory traffic. "differ in radix choice, stage ordering, and register-blocking strategy."
  • register file: The set of hardware registers available for vector operations. "The FFT-32 block uses 16 of NEON's 32 registers and would not fit in AVX2's 16-register file."
  • register pressure: The demand for registers exceeding their availability, causing spills and performance loss. "due to register pressure---a tradeoff discovered automatically."
  • ruletree: SPIRAL’s internal representation of transform derivations as a tree of rewrite rules. "the performance of a ruletree varies greatly depending on its position in a larger ruletree"
  • SIMD: Single Instruction, Multiple Data; parallel execution over vector lanes. "These alternatives use different SIMD instruction mixes with different latencies"
  • SPIRAL: A system for automatic code generation of DSP transforms. "SPIRAL \citep{SPIRAL2005} similarly noted that ``the performance of a ruletree varies greatly depending on its position in a larger ruletree''"
  • split-complex format: A memory layout storing real and imaginary parts in separate arrays. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."
  • state-space expansion: Augmenting the nodes of a search graph to include additional state (e.g., predecessor type) so costs can be conditioned on context. "This is a standard state-space expansion technique from operations research, applied here for the first time to FFT cache correlations."
  • stride: The memory distance between consecutive elements accessed in a pass. "the cost of a radix-4 pass at stage~4 depends on whether the preceding operation left stride-128 or stride-64 data in L1."
  • twiddle factor: Complex roots-of-unity multipliers used in FFT butterflies. "causing twiddle-factor spills that negate the saved memory traffic."
  • unit-stride: Accessing consecutive memory locations with stride 1. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."
  • vld1q_f32: An ARM NEON intrinsic that loads a 128-bit vector of four 32-bit floats from memory. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 79 likes about this paper.