Papers
Topics
Authors
Recent
Search
2000 character limit reached

Axe Layout Abstraction

Updated 29 January 2026
  • Axe Layout is a hardware-aware abstraction that maps logical tensor coordinates to physical hardware axes using sharding, replication, and offset management.
  • It underpins a domain-specific language and compiler that optimize tensor operations on GPUs, multi-GPU clusters, and AI accelerators.
  • Empirical results show that this unified approach achieves competitive or superior performance compared to hand-optimized kernels in high-performance computing.

An axe layout is a hardware-aware abstraction for defining how high-dimensional data structures—specifically tensors—are mapped onto the multi-axis physical spaces of modern computing platforms. This concept enables uniform and efficient mapping of tensor elements onto device meshes, memory hierarchies, and accelerator-specific compute resources. Axe layout unifies all notions of sharding, replication, tiling, and offset within a simple, extensible vocabulary, serving as the foundational layout formalism for the “Axe” distributed domain-specific language (DSL) and compiler targeting GPUs, multi-GPU clusters, and AI accelerators (Hou et al., 27 Jan 2026).

1. Formal Structure of Axe Layouts

At the core of the axe layout abstraction is the specification of a mapping from logical tensor coordinates to physical hardware axes. An axe layout L=(D,R,O)L = (D, R, O) consists of:

  • D (Shard): An ordered list of “iters” Ik=(ek,sk,ak)I_k = (e_k, s_k, a_k), where eke_k is the extent (size) along the subdivided logical dimension, sks_k is the stride along axis aka_k (the physical or logical hardware resource), and aka_k is the named hardware axis (e.g., thread lane, warp, GPU ID, memory bank).
  • R (Replica): A list of replication iters Jt=(rt,ρt,bt)J_t = (r_t, \rho_t, b_t) that duplicate the entire DD-mapping along specified hardware axes via offset strides.
  • O (Offset): A fixed vector offset OZAO \in \mathbb{Z}^A in the axis space after sharding and replication.

This mapping induces a set of physical locations for each logical tensor element:

fL(x)={fD(x)+fR(u)+Ou runs over all replica indices}f_L(x) = \{ f_D(x) + f_R(u) + O \mid u \text{ runs over all replica indices} \}

where each fD(x)f_D(x) computes the axis-wise physical placement from logical index xx using the strides and axis IDs in DD, and fR(u)f_R(u) applies the replication offsets.

Any such mapping can be recast in affine form:

phys(x)=Ax+b\text{phys}(x) = A \cdot x + b

where AA bundles all axis strides and bb includes all offsets (Hou et al., 27 Jan 2026).

2. Practical Composition: Sharding, Replication, and Tiling

The separation of sharding, replication, and offset operations permits a variety of layout strategies unified within the same abstraction:

  • Sharding: Partitioning the logical tensor space across hardware axes (e.g., distributing rows of a matrix over multiple GPUs).
  • Replication: Broadcasting tensor slices or elements across several hardware axes to support primitives like all-reduce or broadcast.
  • Tiling: Using the Kronecker product operator \otimes to assemble larger tensors from tiled sub-layouts, ensuring that their physical footprints do not overlap. The mapping explicitly tracks spans along each axis:

spana(fL)=maxx(y[a])minx(y[a])+1\text{span}_a(f_L) = \max_{x}(y[a]) - \min_{x}(y[a]) + 1

This model allows for direct composition of complex collective operations (e.g., distributed matmul with sharded input and replicated output) and precise avoidance of resource collisions during code generation (Hou et al., 27 Jan 2026).

3. Operators and DSL Integration

The axe layout abstraction underpins an execution model and DSL that decouple logical tensor operations from their physical realization.

  • Execution Scopes: Hierarchical definitions including kernel, thread block (CTA), warpgroup/warp, and thread, permitting partitioned or nested mappings.
  • Tensor Layouts: Each tensor carries metadata for shape, data type, pointer, execution scope, and its axe layout.
  • High-Level Operators: Operators such as copy, pointwise, reduce, and matmul are annotated with their respective layouts; the compiler selects optimal schedules based on these layouts and execution scopes (e.g., using register operations at the thread level or collective reductions across device meshes).

This approach supports both explicit thread-local implementations (as in CuTe) and high-level block/collective implementations (as in Triton) within a single, unified infrastructure (Hou et al., 27 Jan 2026).

4. Transformation Primitives and Normal Forms

Compiler utilities built atop the axe layout system provide formal support for layout analysis and transformation:

  • Canonicalize: Reduces a layout to a normalized form by removing unit extents, merging adjacent iterations sharing axes, and normalizing strides. This enables precise equivalence checking between layouts.
  • Group: Identifies whether the iteration list DD can be split or fused to match a target shape, essential for tiling and slicing.
  • Tile (\otimes): Generates a composite layout from two grouped layouts, scaling one’s strides by the span of the other to ensure non-overlapping placement.
  • Slice: Given an axis-aligned range within the logical tensor, precisely computes the corresponding layout covering that sub-region, taking into account both no-wrap and one-wrap cases.

Together, these primitives enable mechanical derivation of address calculation and loop nesting for arbitrary high-level tensor operators (Hou et al., 27 Jan 2026).

5. Performance and Expressivity

Empirical evaluation demonstrates that the axe layout abstraction, when embedded in the Axe compiler and DSL, achieves performance competitive with or exceeding hand-optimized vendor kernels:

  • Single-GPU GEMM (NVIDIA B200): Axe achieves 97–100% of cuBLAS’s TFLOP/s across Qwen3/LLaMA/GPT-3 shapes.
  • Fused Mixture-of-Experts Layer: Axe outperforms FlashInfer and SGLang Triton internals, with up to 1.36× speedup.
  • Multi-GPU GEMM with Reduce-Scatter: On a 4-GPU mesh, end-to-end latency is up to 1.4× lower than cuBLAS+NCCL.
  • AI Accelerator (AWS Trainium 1): Matches or exceeds vendor reference kernels for both GEMM and attention, reducing code size significantly (e.g., MHA: 228 lines, vs. 1188 for NKI).

These results verify that the single abstraction suffices for mapping both intra-device (threads, warps, registers, on-chip memory) and inter-device (multi-GPU, mesh, all-reduce) layout patterns, while maintaining a compact, analyzable representation (Hou et al., 27 Jan 2026).

6. Significance and Implications

Axe layout introduces a uniform, algebraically tractable basis for tensor data and compute placement across an entire stack of hardware resources. This unification enables:

  • Precise, predictable code generation for a comprehensive range of layouts, including highly nontrivial sharding, mixed replication, and flexible offset management.
  • Mechanized verification of layout-equivalence and collision freedom through canonical forms.
  • Portability of high-level tensor programs across accelerator classes with minimal or no modification.
  • Composition of block-local and distributed primitives inside a single kernel invocation, reflecting the maturing requirements of large-scale deep learning and high-performance machine learning compilation (Hou et al., 27 Jan 2026).

A plausible implication is the obsolescence of ad-hoc, backend-specific layout conventions in favor of explicitly parameterized, analyzable models, leading to enhanced compiler optimizability, extensibility, and correctness guarantees.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Axe Layout.