Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search
Abstract: An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Shortest-Path FFT: What This Paper Is About (In Simple Terms)
Big Picture
This paper is about making a common math tool, the Fast Fourier Transform (FFT), run as fast as possible on modern computer chips. An FFT is used in lots of tech—like music apps, image processing, and science—to turn signals from “time” into “frequency.” The trick is: there are many different ways to program an FFT that all give the same answer, but some ways are much faster than others depending on the computer.
The authors show a new, simple idea: pick the fastest FFT the same way a GPS finds the quickest route—by turning choices into a map and running a shortest-path search.
1) Overview: The Main Topic
An FFT can be built from different “building blocks” and in different orders. Each choice uses different computer instructions and memory patterns. The paper turns all these choices into a graph (like a map), measures how long each choice takes, and then uses a shortest-path algorithm (Dijkstra’s algorithm—like a GPS for graphs) to find the fastest overall plan. They also show that to truly get the best result, you must consider what just happened before each step because that changes how fast the next step runs.
2) What Questions Are They Trying to Answer?
- How can we automatically choose the fastest way to run an FFT on a specific chip?
- If we think of FFT choices as a “map,” can a shortest-path search find the fastest plan?
- Do we need to consider “context”—what the previous step did—to predict the next step’s speed?
- Can this method beat common rules of thumb, like “always use bigger steps” or “use the same best building block everywhere”?
3) How They Did It (With Easy Analogies)
Think of cooking a recipe:
- You need to do several stages. Each stage can be done in a few different ways (like chopping with a knife or using a food processor).
- Even if two methods produce the same result, one might be faster depending on what you just did (e.g., the cutting board is already out and clean).
Now translate that to computers:
- The FFT has “stages” you must do in order.
- Each stage can be done with different “radixes” (radix-2, radix-4, radix-8) or by “fusing” several stages so the data stays in super-fast storage.
- “Registers” are like the chef’s hands—very fast, but you can’t hold much at once.
- The “cache” is like a nearby shelf; it’s faster than the pantry (main memory). If you just used something, it might still be warm and ready (a “warm” cache), making the next step faster. That’s called a cache effect.
Their approach:
- Build a graph where each node is “how many FFT stages are done” and each edge is “a way to do the next 1–5 stages” (like choosing knife vs. food processor).
- Measure how long each edge takes on the real computer (don’t guess).
- Run Dijkstra’s algorithm (a shortest-path search) to find the fastest sequence from start to finish.
Context-free vs. context-aware:
- Context-free: assume each step takes the same time no matter what came before.
- Context-aware: expand the graph to remember what the previous step was. Now the time for the next step reflects what just happened—like knowing the oven is already warm, so baking starts faster.
Why context matters:
- If the previous step warmed up the cache, the next step may run quicker. Ignoring that can lead you to pick a slower overall plan.
4) Main Findings (And Why They Matter)
On an Apple M1 chip (using NEON instructions), the authors tested many combinations and found:
- The context-aware shortest-path method picked a plan that ran at about 29.8 GFLOPS. That’s:
- About 5.2 times faster than doing everything the simplest way (pure radix-2).
- About 34% faster than the best plan found when ignoring context.
- The winning plan was: R4 → R2 → R4 → R4 → Fused-8. This means:
- Start with a radix-4 pass,
- then do a radix-2 pass,
- then two more radix-4 passes,
- then end with a fused block that keeps data in the fastest storage (registers) for several steps.
- This plan is surprising. If you ignore context, you’d never choose that single radix-2 in the middle. But because the previous step “warms the cache,” that specific radix-2 at that moment is actually faster.
- Fused blocks (doing several stages entirely in registers) are extremely powerful—keeping data “in your hands” is faster than moving it back and forth to memory.
- Bigger is not always better. A large fused block that uses many registers (like FFT-32) looked promising but ended up slower due to “register pressure” (you run out of hands to hold everything).
- What’s best depends on the chip. The best plan for Apple M1 isn’t the same as for Intel processors. The method adapts: measure again and run the search.
Why it matters:
- Instead of guessing or using one-size-fits-all rules, you can measure and search to find the best plan for each device.
- Context—what you just did—can change what you should do next. Modeling that can give a big speed boost.
5) What This Means for the Future
- Faster apps: Since FFTs are used in music, images, radar, and more, making them faster speeds up many technologies.
- Portable optimization: On any new chip, measure the step costs and re-run the search. You quickly get a new best plan without rewriting everything by hand.
- Beyond FFTs: The same “shortest-path with context” idea can optimize other multi-step computations where memory and “what just happened” matter—like matrix operations or combining layers in a neural network.
In short: the paper shows a smart, practical way to turn “picking the fastest code” into a map-and-GPS problem—then proves that remembering recent history (context) helps you find a faster route.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of specific gaps and unresolved questions that future work could address to strengthen and extend the paper’s findings.
- Scope of evaluation is narrow:
- Only a single problem size is evaluated (), with complex float32, split-complex layout, and DIF; it is unclear how results and optimal plans vary across sizes (small and very large ), precisions (float16/64), and algorithmic variants (DIT vs. DIF).
- Single-architecture validation:
- Results are reported only on an Apple M1 P-core (NEON, 128-bit); there is no empirical validation on other CPUs (e.g., AVX2/AVX-512, ARMv9 SVE, AMD Zen), mobile ARM cores, or GPUs to substantiate portability claims.
- No comparison to state-of-the-art libraries:
- The paper does not benchmark against FFTW, Apple vDSP/Accelerate, or VkFFT on the same hardware and problem size, leaving open whether the discovered plan is competitive in absolute terms.
- First-order context assumption:
- Context awareness conditions only on the immediately preceding operation type (); the paper does not quantify how much additional accuracy/performance is gained by (e.g., ) or identify when longer-range cache/prefetch correlations materially affect optimality.
- Composability of measured edge weights:
- The approach assumes that the sum of context-conditioned per-edge times predicts end-to-end runtime; the paper does not report prediction error (measured vs. predicted) or validate that additive composition holds when microarchitectural interactions extend beyond one step.
- Context token granularity:
- The “context” encodes only the predecessor edge type; it does not consider other salient state such as stride/working-set class, alignment, page coloring, TLB state, hardware prefetcher phase, or twiddle-cache residency that may further disambiguate costs.
- Edge set completeness and generality:
- Only R2/R4/R8 and three fused blocks (8/16/32) are explored; it remains unclear how to extend the edge set to:
- Other radices (needed for non–power-of-two sizes),
- Alternative fused shapes (e.g., 4, 24, 64) and partial fusions,
- Stockham variants, transposition steps, or mixed in-place/out-of-place strategies.
- Cache-blocking and tiling beyond last-stage fusions:
- The framework focuses on fusing consecutive late stages in-register; it does not model higher-level cache tiling (e.g., L1/L2 blocking with transposes) or include transpose edges necessary for large- and multidimensional FFTs.
- Multicore and NUMA scaling:
- The method is evaluated on a single core; there is no model for inter-thread cache contention, bandwidth saturation, NUMA placement, or synchronization edges, and no demonstration of how the graph/search generalizes to parallel FFTs.
- Batched and streaming use cases:
- Edge weights are measured for single transforms; the impact of batching, repeated plans on hot data, or streaming pipelines (where steady-state cache behavior differs) is not characterized.
- Measurement methodology and robustness:
- While medians over 50 trials and 3 runs are reported, the paper does not detail pinning, frequency throttling controls, cache-flush protocols, or confidence intervals; sensitivity to OS noise, DVFS, and thermal state is not quantified.
- Planning-time scaling and reuse:
- For real applications requiring many , precisions, and layouts, the cumulative measurement burden is unclear; strategies to share/reuse weights across sizes, interpolate/extrapolate costs, or amortize planning time are not explored.
- Predictive models to reduce measurement:
- There is no attempt to learn or regress edge weights from static features (e.g., register pressure, stride classes, twiddle locality) to reduce measurement count while preserving accuracy.
- Numerical accuracy is not assessed:
- Different arrangements and fusions (FMA usage, twiddle ordering) can alter floating-point error; the paper does not quantify accuracy trade-offs or provide bounds/ULP statistics across plans.
- Data layout sensitivity:
- Only split-complex storage is evaluated; the effect of interleaved complex layouts, SoA/AoS variations, and twiddle-table organizations on edge weights and optimal plans is not examined.
- Code-size and I-cache effects:
- Fused blocks may increase code size and pressure the instruction cache; the impact on performance and whether a multi-objective search (time vs. code size) changes the chosen plan is not investigated.
- Prefetching and alignment strategies:
- The framework does not explore explicit prefetch instructions, alignment/padding choices, or address-generation patterns as searchable edges, leaving potential gains unquantified.
- Improving underperforming fused blocks:
- FFT-32 underperforms due to register pressure; the paper does not investigate alternative scheduling (e.g., rematerializing twiddles, partial spilling, sub-blocking) that might make larger fusions viable on NEON or other ISAs.
- Generalization to multidimensional FFTs:
- Extending the graph to 2D/3D FFTs requires modeling transpose phases and interleaving of dimension-wise passes; the paper leaves the design of such edges, and the role of context across transposes, as open.
- Edge-definition reproducibility:
- The exact code-generation mechanism for each edge (radix/fused variant, stage-position specialization) and how correctness is verified are not fully specified, limiting reproducibility and adoption.
- Sensitivity to twiddle-factor handling:
- Only precomputed table loads are considered; the trade-off space between loading vs. on-the-fly twiddle generation (and its effect on registers/cache) is not included in the graph.
- Porting to accelerators:
- While claiming generalization, the approach is not demonstrated on GPUs/NPUs where kernel launch overheads, shared memory, and warp scheduling create different “context” dynamics; how to define and measure edges in those environments is open.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed now, mapped to sectors and accompanied by potential tools/workflows and key dependencies.
- Software libraries (FFT ecosystems)
- Use case: Integrate the context-aware shortest-path planner into FFT libraries (e.g., FFTW, VkFFT, Apple Accelerate/vDSP, ARM Ne10, FFTS) to select architecture-specific radix/fused-block sequences automatically.
- Tools/workflows: Add a planner module that benchmarks conditional edge weights on install/first run and emits the optimal plan for each target CPU (NEON/AVX2/AVX-512).
- Dependencies/assumptions: Availability of codelets for radix-2/4/8 and fused blocks; quick, reliable microbenchmark harness; first-order context (prev-op) sufficiently approximates cache effects for most devices; ability to ship or JIT architecture-specific microkernels.
- Mobile/edge apps (daily life: audio/image/AR)
- Use case: Deploy faster FFTs for audio processing (spectral effects, convolution reverb), image filtering, and AR pipelines on ARM-based mobile devices to reduce latency and battery usage.
- Tools/workflows: Bundle a context-aware FFT planner in app DSP layers; calibrate on first app launch; cache the plan per device model.
- Dependencies/assumptions: NEON-specific kernels; support for split-complex or equivalent layouts; predictable OS scheduling during calibration; small one-time tuning cost.
- Communications/Networking (industry: baseband DSP)
- Use case: Accelerate OFDM FFTs in LTE/5G/Wi‑Fi baseband stacks for lower latency and power draw in SDRs and embedded radios.
- Tools/workflows: Firmware-integrated planner that runs offline (factory or boot-time) to pick plans per hardware unit; ship plan as a firmware configuration.
- Dependencies/assumptions: Real-time constraints preclude tuning during operation; deterministic timing needed; fixed FFT sizes make planning a one-time cost.
- Healthcare (MRI reconstruction, ultrasound beamforming)
- Use case: Speed up FFT-heavy reconstruction pipelines on CPUs/SoCs used in scanners or edge devices for preliminary imaging.
- Tools/workflows: Replace library FFTs in reconstruction toolchains with planner-optimized kernels; continuous integration (CI) step to calibrate per deployment hardware.
- Dependencies/assumptions: Float32/float64 numerical behavior matches clinical tolerances; regulator-compliant validation; access to target hardware for calibration.
- Finance (signal analysis for time series)
- Use case: Faster spectral analysis in back-testing pipelines and live analytics on server CPUs (AVX2/AVX-512); improve throughput and reduce compute costs.
- Tools/workflows: Integrate planner into internal DSP libraries; run on each server SKU to cache optimal plans; A/B test against baseline FFT.
- Dependencies/assumptions: Availability of fused-block codelets for x86; stable CPU frequency/governor during tuning; cluster-wide plan caching by CPU model.
- HPC/scientific computing (academia and industry)
- Use case: Drop-in planner-led FFTs for spectral PDE solvers, Poisson solvers, CFD codes, and FFT-based convolutions to reduce time-to-solution.
- Tools/workflows: Cluster-wide calibration job that profiles per node type; plans stored in module files or container images; transparent to application code.
- Dependencies/assumptions: Scheduler allowances to run microbenchmarks on nodes; minimal jitter; reproducible performance across nodes of same SKU.
- Game/media engines (audio DSP in Unreal/Unity)
- Use case: Reduce latency for in-game audio effects and spectral processing on consoles/SoCs by adopting planner-optimized FFTs.
- Tools/workflows: Update DSP plugins to call into a planner-backed FFT backend; precompute per-platform plans shipped with the game.
- Dependencies/assumptions: Maintain plugin ABI; cross-platform fallbacks for platforms without NEON/AVX; QA validation across device SKUs.
- Robotics (SLAM and perception pipelines)
- Use case: Accelerate FFT-based correlation and filtering steps on ARM-based robots to improve real-time performance and power efficiency.
- Tools/workflows: Integrate the planner into ROS/embedded DSP stacks; tune per-board at manufacturing or provisioning time.
- Dependencies/assumptions: Pipeline actually uses FFT-based methods; deterministic timing constraints; modest tuning time budget.
- Compilers/auto-scheduling (software tooling)
- Use case: Add a “shortest-path kernel planner” pass to MLIR/LLVM/Halide/TVM for FFTs, treating fused blocks as first-class search choices.
- Tools/workflows: A pass that calls a measurement runner to populate conditional edge weights and outputs codegen schedules.
- Dependencies/assumptions: Integration effort; stable microbenchmarks; per-target codegen for fused blocks; acceptable build/tuning overhead.
- Hardware performance characterization (QA/benchmarking)
- Use case: Use the context-aware graph and microbenchmarks to profile cache and register effects on new CPUs/SoCs.
- Tools/workflows: Internal lab harness that produces heatmaps of conditional costs for edge types and stages.
- Dependencies/assumptions: Controlled environment (frequency, thermals); capability to pin cores and isolate noise.
Long-Term Applications
These opportunities require further research, scaling, or ecosystem development before routine deployment.
- Generalization beyond FFTs (software, AI systems)
- Use case: Apply context-aware shortest-path scheduling to other staged computations: DCT/DST, Winograd/overlap‑save convolution, multi-stage filters, sparse factorizations (QR/LU), and neural network layer fusion.
- Tools/workflows: Build codelet libraries and microbench harnesses per domain; extend planners to new operator types with fused alternatives.
- Dependencies/assumptions: Availability of high-quality codelets; larger search spaces; domain-specific correctness and numerical stability; need for shape/dtype awareness.
- Higher-order context modeling (k > 1)
- Use case: Capture longer-range cache/prefetch effects by conditioning edge weights on the last k operations (e.g., two-step context for prefetch pipelines).
- Tools/workflows: Configurable planner with k-context; sampling strategies to limit measurement explosion; pruning heuristics.
- Dependencies/assumptions: Combinatorial growth of nodes (|T|k); need for beam-search or learning-guided pruning; risk of overfitting to microbenchmark conditions.
- ML-guided edge-weight prediction (software tooling)
- Use case: Train models to predict conditional edge costs, reducing the need for exhaustive measurements and enabling cross-device generalization.
- Tools/workflows: Collect datasets across devices; feature engineering (stride, register pressure, cache line usage); hybrid measure+predict planners.
- Dependencies/assumptions: Model robustness to OS jitter and thermal drift; maintainability across hardware generations; validation against ground-truth measurements.
- GPU and heterogeneous scheduling (HPC, AI accelerators)
- Use case: Extend context-aware search to GPUs (shared memory/register file tradeoffs) and CPU–GPU pipelines, choosing fused kernels and data layouts to exploit on-chip memory.
- Tools/workflows: GPU codelets (e.g., CUDA/HIP/Metal) with tunable shared-memory tiles and fusion options; task-graph planners (e.g., for VkFFT/GFFT).
- Dependencies/assumptions: Non-determinism and concurrency complicate measurements; larger thread/block scheduling spaces; portable measurement APIs.
- Multi-size and multidimensional FFTs (HPC, imaging)
- Use case: Support arbitrary radices, mixed-radix factors, and 2D/3D batched FFTs by expanding edge types (e.g., transpose, tiling, fusion across dimensions).
- Tools/workflows: Codelet expansion for additional radices and tensor transposes; DAGs that include layout transformation edges.
- Dependencies/assumptions: Increased planner complexity; memory bandwidth and transpose costs dominate—need multi-objective planning.
- Energy-aware or multi-objective planning (mobile/edge, green computing)
- Use case: Optimize for energy (Joules) or latency–energy Pareto fronts rather than time alone, to extend battery life or meet thermal envelopes.
- Tools/workflows: Edge-weight measurement augmented with on-device power sensors; multi-objective shortest-path or weighted-sum planners.
- Dependencies/assumptions: Accurate energy measurement APIs; variability due to DVFS; policy to select trade-offs at runtime.
- Online/adaptive planners (systems/runtime)
- Use case: Runtime selection and hot-swapping of plans based on device state (temperature, DVFS, co-runner interference) and input size distributions.
- Tools/workflows: Low-overhead monitors; plan repositories; confidence-based switching; integration with JITs.
- Dependencies/assumptions: Minimal perturbation from online measurement; safeguards against oscillations; predictable real-time behavior.
- ISA/hardware co-design feedback (semiconductor industry)
- Use case: Use planner insights to guide ISA and microarchitecture (e.g., register file size, cache line size, fused instructions) by simulating performance under different designs.
- Tools/workflows: Pre-silicon simulators coupled with the planner to evaluate candidate designs; design-space exploration dashboards.
- Dependencies/assumptions: Access to accurate simulators; alignment between microbench metrics and full-application behavior.
- Standardization and policy (benchmarking and procurement)
- Use case: Develop reproducible, context-aware microbenchmarking protocols for kernel selection used in academic publications and government/enterprise procurement.
- Tools/workflows: Open benchmark suites and reporting guidelines capturing conditional costs and planning outcomes.
- Dependencies/assumptions: Community consensus; careful handling of system variability; neutrality across vendors.
- OTA plan distribution and device fleet management (platform vendors)
- Use case: Ship per-SKU optimal plans via OS/library updates (e.g., platform FFT backends) to improve performance and energy across device fleets.
- Tools/workflows: Cloud-based tuning farms; signed plan bundles; telemetry to validate impact.
- Dependencies/assumptions: Secure update channels; device diversity management; regression testing at scale.
Glossary
- AVX2: An x86 SIMD instruction set with 256-bit vectors; here noted for its 16 vector registers. "ARM NEON's 32 registers (vs.\ AVX2's 16) enable a fused block keeping 5 DIF passes in registers."
- beam-width heuristic: A search strategy that keeps only the top k candidates at each step to limit exploration. "and addressed this with a beam-width heuristic."
- butterfly: The basic operation in FFTs that combines inputs into outputs via add/subtract and twiddle multiplications. "The DIF butterfly computes and for 4 parallel butterflies per NEON instruction."
- cache residuals: Recently accessed cache lines that remain hot and can be reused by subsequent operations. "exploiting cache residuals that only exist in context."
- cache warming: The effect where prior accesses pre-load data into cache, reducing the cost of subsequent operations. "so that edge weights capture inter-operation correlations such as cache warming---the cost of operation~B depends on which operation~A preceded it."
- Cooley–Tukey algorithm: A classic FFT algorithm that recursively decomposes the transform into smaller radices. "the Cooley-Tukey algorithm \citep{CooleyTukey1965} requires exactly stages of butterfly computation"
- codelet: A small, specialized FFT fragment used by planners for benchmarking and composition. "FFTW \citep{FrigoJohnson2005,FrigoJohnson1998} addresses this by empirically benchmarking ``codelets'' (small specialized FFT fragments) and combining the fastest ones."
- context-aware edge weights: Edge costs measured conditional on the preceding operation to capture inter-operation effects. "we introduce context-aware edge weights: the graph's node space is expanded to encode the predecessor edge type"
- context-aware model: A graph model where nodes encode both stage and the type of the preceding operation. "In the context-aware model, nodes are expanded to encode the predecessor edge type, so that edge weights capture inter-operation correlations"
- context-free model: A graph model assuming edge weights are independent of preceding operations. "In the context-free model, nodes represent computation stages and edge weights are independently measured instruction costs."
- DAG (directed acyclic graph): A graph with directed edges and no cycles; used here to model FFT stage transitions. "We define a weighted DAG "
- Decimation-in-Frequency (DIF): An FFT variant organizing computations by splitting in the frequency domain. "The DIF butterfly computes and "
- Dijkstra: A shortest-path algorithm used to find the fastest FFT arrangement in the graph. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal)."
- FFTW: A widely used FFT library that plans transforms by benchmarking and composing codelets. "FFTW \citep{FrigoJohnson2005,FrigoJohnson1998} addresses this by empirically benchmarking ``codelets'' (small specialized FFT fragments) and combining the fastest ones."
- FMA (fused multiply-add): An instruction that performs a multiply and add in a single operation. "Single Apple M1 P-core (Firestorm, 3.2~GHz, 128-bit NEON, 2 FMA units)."
- fused register block: A sequence of FFT passes executed entirely in registers before storing back to memory. "Fused blocks load points into SIMD registers, compute passes entirely in-register, then store."
- GFLOPS: Billions of floating-point operations per second; a performance metric. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal)."
- hardware prefetch state: The internal state of the hardware prefetcher that affects memory access performance. "the execution time depends on complex interactions between instruction scheduling, cache hierarchy behavior, register pressure, and hardware prefetch state."
- in-register: Computation performed entirely within CPU registers without intermediate memory traffic. "In-register; zero memory traffic"
- Markov property: The assumption that the next cost depends only on the current state (e.g., the immediate predecessor), not on earlier history. "it directly models the cache correlation as a first-order Markov property in the search graph."
- NEON: ARM’s SIMD instruction set extension used on Apple M1 for vectorized computation. "Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement"
- optimal substructure: A property where optimal solutions to subproblems compose into an optimal overall solution. "FFTW's dynamic programming assumes optimal substructure: the best codelet for a sub-problem remains best regardless of context."
- radix-2 pass: An FFT stage computing butterflies of size 2, advancing by one stage. "Radix-2 pass"
- radix-4 pass: An FFT stage computing butterflies of size 4, advancing by two stages. "Radix-4 pass"
- radix-8 pass: An FFT stage computing butterflies of size 8, advancing by three stages. "Radix-8 pass"
- register blocking: A strategy that schedules computations to keep data in registers across steps to reduce memory traffic. "differ in radix choice, stage ordering, and register-blocking strategy."
- register file: The set of hardware registers available for vector operations. "The FFT-32 block uses 16 of NEON's 32 registers and would not fit in AVX2's 16-register file."
- register pressure: The demand for registers exceeding their availability, causing spills and performance loss. "due to register pressure---a tradeoff discovered automatically."
- ruletree: SPIRAL’s internal representation of transform derivations as a tree of rewrite rules. "the performance of a ruletree varies greatly depending on its position in a larger ruletree"
- SIMD: Single Instruction, Multiple Data; parallel execution over vector lanes. "These alternatives use different SIMD instruction mixes with different latencies"
- SPIRAL: A system for automatic code generation of DSP transforms. "SPIRAL \citep{SPIRAL2005} similarly noted that ``the performance of a ruletree varies greatly depending on its position in a larger ruletree''"
- split-complex format: A memory layout storing real and imaginary parts in separate arrays. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."
- state-space expansion: Augmenting the nodes of a search graph to include additional state (e.g., predecessor type) so costs can be conditioned on context. "This is a standard state-space expansion technique from operations research, applied here for the first time to FFT cache correlations."
- stride: The memory distance between consecutive elements accessed in a pass. "the cost of a radix-4 pass at stage~4 depends on whether the preceding operation left stride-128 or stride-64 data in L1."
- twiddle factor: Complex roots-of-unity multipliers used in FFT butterflies. "causing twiddle-factor spills that negate the saved memory traffic."
- unit-stride: Accessing consecutive memory locations with stride 1. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."
- vld1q_f32: An ARM NEON intrinsic that loads a 128-bit vector of four 32-bit floats from memory. "Split-complex format (separate Re/Im arrays) enables unit-stride vld1q_f32 loads."
Collections
Sign up for free to add this paper to one or more collections.