Papers
Topics
Authors
Recent
Search
2000 character limit reached

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Published 14 Apr 2026 in cs.DC, cs.LG, and cs.PL | (2604.13327v1)

Abstract: Modern GPU workloads, especially LLM inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.

Summary

  • The paper introduces the Event Tensor abstraction that unifies fine-grained GPU synchronization for both static and dynamic scheduling in ML inference.
  • It presents a compiler framework that fuses operators into a single persistent kernel, significantly lowering kernel launch overhead and reducing engine warmup times.
  • Empirical results demonstrate up to 1.40× speedup on tensor parallel workloads and marked efficiency improvements for irregular data-dependent tasks such as MoE layers.

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Motivation and Background

Contemporary GPU workloads for ML inference, particularly in LLMs, are increasingly constrained by kernel launch latencies and enforced synchronization boundaries that impede inter-kernel parallelism. While megakernel fusion eliminates launch gaps and exposes fine-grained dependencies for static workloads, real-world LLM inference is highly dynamic—variable input shapes and data-dependent control flows (e.g., Mixture-of-Experts layers) remain major barriers. Prior approaches such as CUDA Graphs or monolithic megakernels are hampered by either the persistence of synchronization boundaries or their inability to flexibly adapt to runtime dynamism. Furthermore, the complexity of handcrafting megakernel scheduling logic hinders adoption and composability.

Event Tensor Abstraction

The paper introduces the Event Tensor abstraction, providing a unified compiler IR primitive that extends scalar/standalone semaphore synchronization to first-class multi-dimensional tensor events. Each element of the Event Tensor represents a synchronization event (e.g., completion of a tiled task), allowing explicit, compact encoding of producer/consumer relationships over symbolic, shape-dynamic axes in the computational graph. This elevates the granularity of dependency tracking from coarse (operator-level) to fine (tile- or task-level), crucial for persistent kernels with significant task tiling.

Through symbolic representation, each Event Tensor encodes dependencies for a large set of tasks parameterized by symbolic dimensions such as batch or sequence length. The framework incorporates data-dependent triggering and dynamic event update: dependencies between tiles can be decided at runtime from input values, enabling support for irregular control flows such as token routing in MoE layers.

Compiler Pipeline and Scheduling Transformations

The Event Tensor abstraction enables the construction of the Event Tensor Compiler (ETC), which fuses and schedules operators into a single persistent kernel. ETC provides static and dynamic scheduling transformations as formalized rewriting passes:

  • Static scheduling: The compiler builds per-SM execution queues, statically assigns tasks, lowers Event Tensor dependencies into notify()/wait() semaphore operations, and constructs persistent kernel main loops. This path is ideal for regular workloads with predictable task durations and shapes.
  • Dynamic scheduling: For workloads with unpredictable latencies or data-dependent control flow, an on-GPU lightweight scheduler is generated. It dynamically dispatches tasks to SMs as their dependencies become ready, using atomic push/pop operations and task readiness tracking on the Event Tensor.
  • The compiler supports both strategies, exposing composable scheduling policy selection to the user or higher-level frameworks, and lowering the runtime state to only integer event counters and a shared task queue—no explicit runtime task graph materialization is needed.

Expressiveness: Shape and Data-Dependent Dynamism

Key expressiveness arises from treating the Event Tensor as first-class: symbolic shapes enable compiling one persistent kernel covering many runtime shapes, eliminating repeated recompilation/graph capture costs. For data-dependent execution (e.g., sparse or irregular subgraphs such as in MoE), runtime index-based update and notification allow consumer tasks to trigger only when their logical inputs are completed. Experiments demonstrate that the abstraction seamlessly subsumes both static and dynamic scheduling, with theoretical and empirical benefits for workloads with shape variability and input-dependent task graphs.

Empirical Results

The authors provide detailed evaluation against highly tuned baselines (vLLM, SGLang with CUDA Graph, specialized libraries) using multi-GPU clusters. Notable, numerically strong results include:

  • Up to 1.40× speedup on fused GEMM + Reduce-Scatter and All-Gather + GEMM tensor parallel workloads versus non-fused baselines.
  • For data-dependent workloads (MoE), performance advantages of up to 1.23× versus highly optimized kernels.
  • In dynamic, low-batch LLM inference scenarios, ETC matches or surpasses baselines, while reducing engine warmup times by up to 3.5×, due to AOT compilation leveraging symbolic Event Tensors rather than runtime JIT/captures (e.g., 35s initialization vs. up to 583s for vLLM/SGLang).
  • Empirical analysis on static vs. dynamic scheduling shows up to 8% gains due entirely to inter-kernel parallelism unlocked by fine-grained Event Tensor dependencies; static scheduling dominates for regular workloads, while dynamic strategies excel in irregular (e.g., MoE) settings.

Implications and Future Directions

Practically, ETC provides a path toward eliminating persistent kernel launch overheads, minimizing global synchronization, and fully leveraging hardware parallelism in diverse, real-world LLM inference serving scenarios. The abstraction offloads complexity from the user, reduces kernel engineering labor, and is compiler-backend agnostic. By bridging shape- and data-dependent dynamism systematically, ETC enables true ahead-of-time compilation and deployability—significantly simplifying operational complexity in production serving clusters.

The implications for ML compiler and systems design are substantial: Event Tensor generalizes task graph representations and synchronization, catalyzing a shift from handcrafted, operator-specific megakernels to systematic, optimizable compiler IR-driven pipelines. This points toward a future in which dynamically scheduled, fully pipelined megakernels can be automatically generated from high-level computational graphs, unlocking further hardware efficiency gains and facilitating broader fusion and parallelization strategies.

Theoretical avenues for future development include higher-level program analysis for automatic Event Tensor graph extraction, adaptive scheduling policy selection based on workload profiling, and application to other accelerators or execution paradigms beyond GPUs.

Conclusion

The Event Tensor abstraction and the ETC compiler framework constitute a unified, expressive, and highly efficient mechanism for compiling dynamic megakernels suitable for modern ML inference workloads. By abstracting fine-grained event dependencies as multi-dimensional, symbolic tensors, ETC delivers state-of-the-art latency, resource utilization, and deployment flexibility. Its generality and minimal runtime overhead make it a promising foundation for continued advances in ML compiler technology and high-performance GPU serving systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about making big AI models (like the ones behind chatbots) run faster on GPUs. The authors introduce a new idea called an “Event Tensor” and a compiler (a code-making tool) built around it, called the Event Tensor Compiler (ETC). Their goal is to cut delays between tiny GPU jobs, overlap more work at the same time, and still handle real-world “messiness” like changing input sizes and choices made at runtime (for example, in Mixture-of-Experts models).

What problems are they trying to solve?

In simple terms, they asked:

  • How can we reduce the wasted time caused by starting lots of tiny GPU jobs one after another?
  • How can we let different parts of the model run at the same time, even if they depend on each other in complicated ways?
  • How can we do all this when inputs change size (like different batch sizes) and when the model’s path depends on the data (like which “expert” in an MoE gets a token)?
  • Can we make a tool that developers can actually use without writing tons of tricky, error-prone GPU code?

How does their approach work?

Think of a GPU like a factory with many workers (SMs). A normal setup makes the workers wait for a manager (the CPU) to hand out one small job at a time. That creates gaps and wasted time.

This paper fuses many small jobs into one big, always-running “megakernel,” so workers don’t keep stopping and starting. But doing that is hard when job sizes change or when the next step depends on results we only know later. Here’s where the new ideas come in:

The Event Tensor (the “scoreboard”)

Imagine a big grid-like checklist where each square is an “event” that flips to “done” when a small piece of work finishes. That grid is the Event Tensor. It tracks exactly which little pieces (tiles) are done and which ones can start next. Because it’s a tensor (a multi-dimensional array), it scales to millions of tiny events and matches how AI data is laid out.

  • Shape dynamism: The grid can have sizes that aren’t fixed ahead of time (like batch size B). At runtime, it just fills in the exact size without recompiling.
  • Data-dependent dynamism: Some models (like MoE) decide at runtime where tokens go. The Event Tensor can update its “who waits for whom” rules on the fly based on those choices, then trigger the right next tasks.

Two ways to give out work (scheduling)

The compiler can generate megakernels that use one of two task assignment styles:

  • Static scheduling: Before running, each worker gets a pre-made to-do list. This is like assigning seats on a bus ahead of time—very low overhead and great when tasks are predictable.
  • Dynamic scheduling: A tiny on-GPU “dispatcher” hands out the next ready task to whichever worker is free. This is like a smart queue—perfect when tasks vary or depend on data.

Because the Event Tensor is the unified “scoreboard,” either approach can coordinate fine-grained dependencies cleanly.

What did they find?

The authors tested ETC on common and tough cases for LLMs:

  • Overlapping compute and communication across GPUs (for example, GEMM + Reduce-Scatter and All-Gather + GEMM): Up to about 1.40× faster than strong baselines. The choice of static or dynamic scheduling depends on whether timing is predictable.
  • Mixture-of-Experts (MoE) layers (which are very dynamic): Up to about 1.23× faster than specialized libraries by fusing everything into one megakernel and using dynamic scheduling for better load balancing.
  • End-to-end low-batch serving (the “fast response” scenario): Lower time per generated token than top serving systems (like vLLM and SGLang) in many settings, especially at batch size 1, because fine-grained overlaps remove idle gaps between operators.
  • Warmup time: ETC compiles ahead-of-time (AOT). That means no repeated just-in-time (JIT) compilation or repeated CUDA Graph captures when shapes change, cutting warmup overhead by up to 3.5× in their tests.

Why this matters: Even “only” 10–40% faster can save big money at datacenter scale and improves user experience for real-time apps.

Why is this important?

  • Faster, smoother LLMs: The Event Tensor lets different parts of the model run in parallel without tripping over each other, which is crucial for quick responses in chat, coding help, and agent-like tools.
  • Handles real-world messiness: Different batch sizes and data-dependent choices (like MoE routing) don’t force costly re-compilation or graph recaptures. The same compiled megakernel adapts at runtime.
  • Less engineering pain: Instead of hand-crafting complicated GPU code with delicate synchronization, developers rely on a clear abstraction (Event Tensor) and a compiler pipeline that generates high-performance megakernels.
  • Broadly useful: Because the abstraction sits at the compiler level, it can be integrated into other toolchains and benefit the wider ML systems community.

Bottom line

The paper introduces Event Tensor—a “scoreboard” for tiny GPU tasks—and a compiler (ETC) that uses it to build fast, flexible megakernels. By unifying how dependencies are tracked, ETC makes it possible to overlap more work, handle changing shapes and data-dependent paths, and reduce warmup costs, all while keeping programming manageable. This leads to state-of-the-art latency for serving LLMs, especially in real-time, low-batch scenarios.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of unresolved issues that future work could concretely address.

  • Scalability of the on-GPU dynamic scheduler: The current centralized global-memory queue is acknowledged to risk contention as SM counts grow, but contention levels, throughput limits, and tail-latency effects are not quantified; alternative designs (e.g., per-SM queues, hierarchical or work-stealing schedulers) are not explored or evaluated.
  • Static scheduling policy quality: Per-SM task queues are built with a simple round-robin policy and ad hoc assumptions. There is no cost model or autotuner to choose task orders that optimize overlap, cache locality, or communication/computation balance across shapes and workloads.
  • Automatic selection of scheduling strategy: The paper manually chooses static or dynamic scheduling per workload; there is no analytical model or runtime policy to auto-select or hybridize strategies at operator granularity based on predicted variability, contention, and overheads.
  • Handling of shape dynamism in static scheduling: The “sample-and-reuse next-larger schedule” approach can waste work or degrade locality for unseen shapes; there is no method to pick representative shape sets, bounds on suboptimality, or fallback for shapes exceeding sampled envelopes.
  • Loss of parallelism under conservative static handling of data-dependent dynamism: Rewriting notify()/wait() to a single event in the worst case may collapse fine-grained parallelism; the impact on latency and ways to retain partial parallelism under uncertainty are not studied.
  • Memory overhead and layout of Event Tensors: Event counters can number in the millions for deep, tiled graphs; the memory footprint, cache effects, and HBM traffic due to frequent atomic updates/spin-waits are not characterized, nor are storage/placement strategies (e.g., tiling, sharding, compression) evaluated.
  • Spin-waiting costs and SM efficiency: Wait() uses busy-wait loops that may waste SM cycles and energy or interfere with latency-sensitive tasks; hardware-supported waits, backoff strategies, or cooperative groups-based primitives are not investigated.
  • Correctness and robustness of event protocols: There is no formal model or proof of deadlock/livelock freedom with notify/wait and dynamic triggers, nor mechanisms for watchdogs, timeouts, or safe recovery if events are miscounted or never reach zero.
  • Determinism and reproducibility: Dynamic push-pop scheduling introduces nondeterministic execution order; its impact on numerical reproducibility, debugging, and regression testing is not discussed, and deterministic modes are not provided.
  • Fusion profitability and resource pressure: The compiler lacks an analysis to decide when to fuse or de-fuse to avoid register/shared-memory pressure, occupancy loss, or instruction cache pressure; no measurements of code size growth or performance cliffs relative to vendor libraries (e.g., cuBLAS) are reported.
  • Autotuning of tile shapes and micro-kernels: Tile sizes and low-level mapping are not automatically tuned in the megakernel context, and the paper notes cases where compiler-generated GEMMs trail cuBLAS; systematic autotuning integrated with Event Tensor scheduling is absent.
  • Debuggability and profiling: With compiled-in scheduling and no materialized task graph, there is no tooling described to trace dependencies, visualize event lifecycles, attribute stalls to events or queues, or diagnose deadlocks/perf regressions.
  • Multi-GPU/device-wide scheduling semantics: While Event Tensors can be sharded and some collectives are fused, there is no general mechanism for cross-device event propagation, backpressure across NIC/links, or dynamic scheduling that spans devices and handles interconnect variability.
  • Communication patterns and topologies: Evaluation focuses on ring all-gather and multimem-based reduce-scatter; generalization to other collectives (tree, butterfly, all-to-all for expert parallel), topologies (e.g., fat-tree, IB), and interaction with NCCL’s internal scheduling is not addressed.
  • Portability beyond NVIDIA Blackwell/CUDA 13: The approach relies on CUDA atomics, multimem PTX, and DMA engines; feasibility, performance, and alternative implementations on other vendors (e.g., AMD ROCm) or older GPUs are not evaluated.
  • Multi-tenancy, fairness, and preemption: Persistent megakernels can monopolize SMs; interactions with MPS/MIG, fairness across co-located jobs, preemption/priority mechanisms, and their impact on latency are not studied.
  • Tail latency and variability: Results emphasize average speedups; p95/p99 latency, sensitivity to network/compute jitter, and tail amplification from scheduler contention or spin-waiting are not reported.
  • Compile-time and deployment costs: AOT avoids runtime recapture, but the paper does not quantify compile time, binary size, memory use, or the operational cost of managing many shapes/models/variants; caching and incremental compilation strategies are not discussed.
  • Safety of data-dependent indexing and count initialization: The einsum-like mapping and runtime-derived counters (e.g., from topk/exp_indptr) lack static verification; tooling to check that wait counts match producer counts and catch index out-of-bounds or mismatched triggers is missing.
  • Weight prefetching reliance on user annotations: The prefetch pass depends on manual hints; automatic dependence analysis, correctness checks (e.g., aliasing, lifetime), and sensitivity to mis-annotations are not provided.
  • Generality across workloads: Although the abstraction is claimed to be general, evaluation is limited to LLM inference on B200; applicability to training, sparse/graph workloads, diffusion models, and non-ML GPU applications remains untested.
  • Integration with serving engines and system overheads: End-to-end results note higher CPU-side overhead in the ETC-based engine; concrete integration pathways to existing schedulers (vLLM/SGLang), and methods to reduce distributed runtime overheads, are not detailed.
  • Limits of the feed-forward assumption: Data-dependent support assumes strictly feed-forward graphs; support for dynamic control-flow with cycles, early exits, retries, or loop-carried dependencies is not specified.
  • Failure modes for unseen/degenerate runtime conditions: Behavior under extreme shape outliers, highly skewed MoE routing, or partial device failures (e.g., link degradation) is not characterized; mechanisms for graceful degradation or adaptive reconfiguration are absent.

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, based on the Event Tensor abstraction and the Event Tensor Compiler (ETC).

  • LLM serving latency and cost reduction in datacenters (software, cloud, finance)
    • Use ETC to compile fused, persistent megakernels for dynamic low-batch decoding, reducing kernel launches, exposing inter-operator parallelism, and eliminating repeated CUDA Graph recaptures. Reported gains: up to 1.48x TPOT speedup at batch size 1 and up to 3.5x lower warmup times.
    • Tools/products/workflows: ETC-integrated build step producing AOT deployment artifacts; a “shape template registry” for symbolic-dimension graphs; per-model cost calculators reflecting improved TPOT.
    • Assumptions/dependencies: NVIDIA GPUs with fast atomics and multimem/copy-engine features; model operators tileable; fused kernels must be correctly tuned for target hardware; integration with engines like vLLM/SGLang or bespoke runtimes.
  • Real-time agentic workflows and coding assistants (software, education)
    • Deploy ETC-compiled pipelines to cut interactive latency in step-wise inference (e.g., attention pipelines that overlap Q/K/V ops, MoE routing+GroupGEMM). Benefits are largest in continuous low-batch decoding typical of agent tools and IDE copilots.
    • Tools/products/workflows: CI/CD step to generate AOT megakernels; runtime configuration selecting static or dynamic scheduling per operator; weight prefetching enabled by compiler annotations.
    • Assumptions/dependencies: Minimal runtime overhead from dynamic scheduling; correctness of fine-grained event dependencies; engine supports dynamic shapes without JIT recapture.
  • Turnkey MoE layer acceleration (software/AI platforms)
    • Replace multi-launch MoE sequences with a single fused megakernel using data-dependent Event Tensors (topk/exp_indptr) for routing and triggering GroupGEMM tiles; measured up to 1.23x speedup at 1024 tokens.
    • Tools/products/workflows: “MoE megakernel” library packaged as an ETC pass; diagnostics for token imbalance and expert load.
    • Assumptions/dependencies: Accurate, timely computation of routing tensors; central task queue contention controlled at scale; sufficient GPU memory for persistent kernel state.
  • Compute-communication overlap in tensor-parallel inference (software, energy)
    • Adopt fused GEMM+Reduce-Scatter (dynamic scheduler) and All-Gather+GEMM (static scheduler) kernels to keep SMs and network interconnect busy, achieving up to ~1.40x speedups over non-overlapped baselines.
    • Tools/products/workflows: A set of pre-verified fused collectives packaged for common TP sizes; ring-algorithm aware static scheduling presets; multimem-based RS kernels.
    • Assumptions/dependencies: NVLink/PCIe bandwidth stability; access to multimem PTX and DMA copy engines; kernel sizing compatible with communication chunking.
  • AOT deployment for shape-dynamic models (software, operations)
    • Eliminate runtime JIT and graph recapture complexity by compiling symbolic-shape Event Tensor graphs once, materializing concrete graphs at runtime without recompilation.
    • Tools/products/workflows: Warmup-free deployment pipelines; shape coverage policies (e.g., template sampling for extreme shapes); observability hooks to confirm no recapture occurs.
    • Assumptions/dependencies: Symbolic shape set sufficiently covers production inputs; conservative handling for unseen shapes (e.g., next-larger execution queues) is acceptable for SLOs.
  • Capacity planning and energy efficiency improvements (finance, energy, policy)
    • Use measured TPOT and kernel time reductions to reduce GPU-hours per request; incorporate into cost models and sustainability dashboards.
    • Tools/products/workflows: TPOT-based autoscaling policies; energy-per-token KPI tracking; procurement guidance favoring AOT, fused deployments.
    • Assumptions/dependencies: Performance gains persist under mixed traffic and multi-tenant loads; observability can attribute savings to reduced launch gaps and improved overlap.
  • Academic experimentation and teaching in compilers/parallel systems (academia)
    • Adopt the Event Tensor IR to teach fine-grained dependency management, dynamic scheduling trade-offs, and megakernel compilation; run lab assignments on tile-level graphs without heavy runtimes.
    • Tools/products/workflows: Open-source ETC passes on TVM; visualization of Event Tensor counters and task queues; reproducible benchmarks (e.g., GEMM+RS, MoE).
    • Assumptions/dependencies: Access to GPUs that support atomics; lab infrastructure for measuring contention and scheduler overhead.
  • Robotics perception and planning on NVIDIA edge platforms (robotics)
    • Apply ETC to low-latency perception models (e.g., object detection with dynamic batch variations) and planning modules with data-dependent branches, reducing host launches and improving pipeline overlap.
    • Tools/products/workflows: Jetson-compatible ETC builds; fused attention/MLP megakernels; scheduler policies tuned for real-time constraints.
    • Assumptions/dependencies: CUDA features available on edge GPUs; deterministic latency requirements satisfied (consider static scheduler for critical paths).
  • Healthcare NLP assistants and clinical documentation (healthcare)
    • Improve responsiveness of clinical dictation, coding support, or triage assistants by deploying ETC-compiled megakernels for variable-length, low-batch interactions; eliminate warmup delays during shift changes or workload spikes.
    • Tools/products/workflows: AOT artifacts signed for regulated deployment; operational safeguards for dynamic layers (MoE).
    • Assumptions/dependencies: Compliance considerations (PHI security, auditability) met; hardware availability; predictable behavior across diverse input shapes.
  • Event Tensor observability and debugging (software tooling)
    • Provide lightweight runtime introspection of event counters, notifications, and scheduler queues to debug stalls and load imbalance.
    • Tools/products/workflows: “Event Tensor Inspector” integrated with profiling tools; anomaly detection for spin-wait hotspots.
    • Assumptions/dependencies: Minimal overhead from instrumentation; access to compiler-level annotations in production builds.

Long-Term Applications

Below are opportunities that require further research, scaling, cross-vendor support, or development.

  • Cross-vendor GPU/NPU backends and portability (software, hardware)
    • Generalize Event Tensor compilation to AMD/Intel GPUs and NPUs by mapping notify/wait to vendor atomics and providing equivalents for multimem/copy engines.
    • Tools/products/workflows: Backend portability layer; IR-to-target mappers; validation suites across architectures.
    • Assumptions/dependencies: Feature parity for atomics and device-side scheduling; vendor toolchain maturity; performance tuning for non-NVIDIA backends.
  • Distributed Event Tensors across nodes (software, HPC)
    • Extend Event Tensor to inter-node scheduling (cluster-wide counters, RDMA-based notifications) for dynamic, multi-GPU, multi-host pipelines.
    • Tools/products/workflows: Cluster-level “event fabric” with reliable triggers; distributed schedulers aware of network latency/jitter.
    • Assumptions/dependencies: Robust, low-latency networking; fault-tolerant event propagation; security and isolation in multi-tenant clusters.
  • OS/driver-level GPU scheduling integration (software systems)
    • Co-design ETC with driver/runtime to reduce global-memory queue contention and provide hardware-assisted queues, improving dynamic scheduler scalability.
    • Tools/products/workflows: Driver APIs for event-queue primitives; per-SM or per-cluster queues; QoS-aware scheduling for mixed workloads.
    • Assumptions/dependencies: Vendor support for new APIs; careful isolation to avoid interference across processes.
  • Hardware-software co-design for event-driven execution (hardware, standards)
    • Introduce ISA primitives for event tensors (counter arrays, hardware notify/wait, event-triggered dispatch) to minimize spin-wait and atomic overhead.
    • Tools/products/workflows: Architectural proposals; simulators; evaluation on future GPU designs.
    • Assumptions/dependencies: Long hardware lead times; standards coordination; compatibility with existing programming models.
  • Formal verification of dynamic megakernel correctness (academia, safety-critical sectors)
    • Develop verification frameworks proving absence of deadlocks/livelocks and correctness of data-dependent triggers for safety-critical applications (automotive, healthcare).
    • Tools/products/workflows: Model checkers for Event Tensor graphs; proof-carrying code; certified compilation passes.
    • Assumptions/dependencies: Scalable formal methods for large task graphs; integration with compiler IR.
  • Real-time mixed-criticality scheduling in robotics/industrial control (robotics, manufacturing)
    • Combine static scheduling (determinism for critical tasks) with dynamic scheduling (best-effort tasks) within one megakernel, governed by event priorities and deadlines.
    • Tools/products/workflows: RTOS integration; priority-aware event tensors; latency monitors.
    • Assumptions/dependencies: Predictable execution bounds; certification and safety standards; harmonization with control loops.
  • Edge/mobile on-device AI with battery-aware scheduling (consumer devices, energy)
    • Adapt ETC to mobile NPUs/GPUs for low-latency assistants, using energy-aware tile sizing and scheduling to balance responsiveness and battery life.
    • Tools/products/workflows: Energy models in the compiler; adaptive scheduling policies; AOT packaging for app stores.
    • Assumptions/dependencies: Mobile hardware APIs for atomics/eventing; thermal constraints; privacy-preserving on-device inference.
  • Auto-tuning and shape-aware scheduling optimization (software, research)
    • Build automatic schedulers that learn optimal static/dynamic mixes per operator, tile sizes, and fusion boundaries across shape distributions.
    • Tools/products/workflows: Auto-schedulers; telemetry-driven feedback loops; policy engines for continuous optimization.
    • Assumptions/dependencies: Sufficient observability; non-intrusive online tuning; stability under changing traffic.
  • Standardization and policy around dynamic AOT compilation (policy, industry consortia)
    • Define best practices and metrics (e.g., TPOT, warmup energy) and recommend AOT for dynamic workloads to reduce energy use and operational complexity in public cloud deployments.
    • Tools/products/workflows: Industry guidelines; compliance checklists; procurement requirements favoring AOT and fused execution.
    • Assumptions/dependencies: Broad ecosystem buy-in; transparent measurement and reporting.
  • Event Tensor–aware developer tooling and IDEs (software tooling, education)
    • Provide IDE integrations to visualize dependency mappings, symbolic shapes, and fused kernels; assist with correctness and performance hints.
    • Tools/products/workflows: Visual graph browsers; tile-level simulators; linting for unsafe event updates.
    • Assumptions/dependencies: Compiler APIs exposed to tools; developer training to interpret event-driven graphs.

Glossary

  • Ahead-of-time (AOT) compilation: Compiling programs before execution time so binaries are ready at runtime, avoiding JIT overhead. "ETC achieves true ahead-of-time (AOT) compilation for dynamic workloads"
  • All-Gather: A collective communication that gathers data from all participants and distributes the complete result to all. "All-Gather + GEMM"
  • Atomic operations: Indivisible, hardware-supported operations (e.g., atomic increment/decrement) used for synchronization. "implemented with efficient hardware atomics"
  • Blackwell architecture: NVIDIA GPU architecture codename used to refer to the latest hardware generation. "Blackwell architecture"
  • Compressed Sparse Row (CSR) format: A memory-efficient sparse matrix representation using row pointers and column/value arrays. "compressed sparse row (CSR) format."
  • Cooperative Thread Array (CTA): An NVIDIA GPU execution unit (thread block) used as a tiling and synchronization boundary. "operators are already partitioned into CTA-level tiles"
  • Counter-based semaphores: Synchronization primitives that use counters to track outstanding dependencies. "counter-based semaphores"
  • CUDA Graphs: An NVIDIA mechanism to capture and replay a fixed kernel execution graph to reduce launch overheads. "CUDA Graphs"
  • Data-dependent dynamism: Runtime variation in control flow and dependencies driven by data values. "data-dependent dynamism"
  • Device function: A GPU function defining a grid of parallel tasks that execute on the device. "A device function defines a grid of tasks"
  • Direct Memory Access (DMA): Hardware-assisted memory transfer without involving SM compute cores. "copy engine (DMA)"
  • Dynamic scheduling: Runtime task dispatch that balances load by scheduling ready tasks as dependencies resolve. "dynamic scheduling improves load balance across SMs."
  • Einsum notation: A concise index-based notation for tensor contractions and reductions. "Einsum notation"
  • Event Tensor: A multi-dimensional array of synchronization events capturing fine-grained task dependencies. "An Event Tensor is a multi-dimensional structure"
  • Expert routing: The process in MoE models of assigning tokens to expert networks based on gating decisions. "expert routing"
  • GEMM + Reduce-Scatter: A fused compute-communication pattern combining matrix multiplication with a Reduce-Scatter collective. "GEMM + Reduce-Scatter"
  • GroupGEMM: Batched GEMM where inputs are grouped (e.g., by expert) to process variable-sized token groups efficiently. "GroupGEMM"
  • indptr: The index pointer array in CSR-like structures storing prefix sums of element counts per row/group. "indptr is a term commonly used in sparse matrix represen- tations such as the compressed sparse row (CSR) format."
  • Just-in-time (JIT) compilation: Compiling programs at runtime, often to specialize for dynamic shapes or values. "just-in-time (JIT) com- pilation"
  • KV-Cache: Cached keys and values used in transformer attention to accelerate decoding. "KV-Cache"
  • Megakernel: A single, large, persistent GPU kernel that fuses many operators to reduce launch overhead and expose parallelism. "megakernel techniques"
  • Mixture-of-Experts (MoE): A model architecture that routes inputs to a subset of expert networks per token. "Mixture-of-Experts (MoE)"
  • NVLink: NVIDIA’s high-bandwidth interconnect for multi-GPU systems. "NVLink"
  • On-GPU scheduler: A scheduler running on the GPU that manages a global task queue and dispatches ready tasks. "an on-GPU scheduler"
  • Persistent kernel: A long-running GPU kernel that continuously executes queued tasks without relaunching. "a single persistent kernel"
  • Programmatic Dependent Launch (PDL): A technique to programmatically orchestrate dependent kernel launches for overlap. "Programmatic Dependent Launch (PDL)"
  • PTX instructions: NVIDIA’s intermediate assembly language instructions for GPUs. "CUDA multimem PTX instructions"
  • Reduce-Scatter collective: A collective operation that reduces input data across devices and scatters the reduced partitions. "Reduce-Scatter collective"
  • Ring algorithm: A communication pattern for collectives that passes data around devices in a ring. "ring algorithm"
  • RoPE: Rotary positional embeddings used in transformer attention to encode position information. "RoPE"
  • Semaphore-based synchronization: Coordination using semaphores to enforce ordering among tasks. "semaphore-based synchro- nization is a well-known primitive"
  • Spin-wait: A busy-wait loop where a thread repeatedly checks a condition until it is satisfied. "spin-wait state"
  • Streaming Multiprocessor (SM): The primary GPU execution unit responsible for running warps/threads. "streaming multiprocessors (SMs)"
  • Symbolic dimensions: Dimensions expressed as symbols (e.g., batch size B) resolved at runtime without recompilation. "symbolic dimensions"
  • Tensor Cores: Specialized hardware units in NVIDIA GPUs for high-throughput matrix operations. "tensor core calls"
  • Tensor parallelism: Parallelization strategy that partitions tensors across devices to distribute computation. "tensor-parallel execution"
  • TopK: Selecting the k largest (or smallest) elements, often used for routing/gating in MoE. "Computing TopK depends only on the preceding Attention output"
  • Triton Distributed: A compiler-based system for generating distributed kernels with overlapping compute/communication. "Triton Distributed v0.0.2-rc"
  • TVM: An open-source deep learning compiler stack for optimizing and generating code for diverse hardware. "Apache TVM"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 190 likes about this paper.