Papers
Topics
Authors
Recent
Search
2000 character limit reached

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Published 25 Mar 2026 in cs.LG | (2603.24517v1)

Abstract: Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a LLM to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.

Summary

  • The paper introduces AVO, a framework that employs self-directed coding agents to iteratively optimize kernel implementations and deliver significant performance gains.
  • It replaces static variation mechanisms with interactive agents that leverage historical context, domain-specific knowledge, and real-time feedback.
  • Empirical results on NVIDIA B200 GPUs reveal up to 10.5% improvement over state-of-the-art baselines for both causal and non-causal attention kernels.

Agentic Variation Operators: Autonomous Evolutionary Search for Performance-Critical Kernel Optimization

Overview and Motivation

The paper "AVO: Agentic Variation Operators for Autonomous Evolutionary Search" (2603.24517) introduces Agentic Variation Operators (AVO), a paradigm shift in evolutionary search that replaces static variation mechanisms (mutation, crossover, fixed heuristics) with self-directed coding agents. Classical evolutionary search leverages LLMs as candidate generators embedded in a fixed pipeline, with limited agency and no interactive diagnosis or iterative refinement. In contrast, AVO elevates the agent to the core of the variation step, empowering it to autonomously consult historical artifacts, domain-specific knowledge, and real-time evaluation feedback, thereby iteratively proposing, repairing, critiquing, and verifying kernel implementations. Figure 1

Figure 1: Comparison of classical LLM-augmented evolutionary search frameworks and the proposed AVO, highlighting the transition from a fixed pipeline to autonomous, persistent agentic optimization.

This agentic approach enables discovery of optimizations that are unattainable in one-shot LLM frameworks, especially in highly optimized domains where further progress requires sustained, hardware-aware engineering. The central empirical target is the attention kernel, a performance-critical primitive in transformer architectures, on NVIDIA Blackwell (B200) GPUs. Both cuDNN and FlashAttention-4 (FA4) represent months of expert-level manual optimization; surpassing these kernels demands prolonged autonomous, iterative reasoning and hardware adaptation.

Formalization of Agentic Variation Operators (AVO)

AVO redefines the variation operator Vary(Pt)Vary(\mathcal{P}_t) as an autonomous agentic process:

Vary(Pt)=Agent(Pt,K,f)Vary(\mathcal{P}_t) = Agent(\mathcal{P}_t, \mathcal{K}, \mathbf{f})

where Pt\mathcal{P}_t is the lineage of prior solutions, K\mathcal{K} is the domain-specific knowledge base, and f\mathbf{f} encapsulates correctness and throughput evaluation. The agent operates as a persistent, interactive coding loop: consulting previous solutions, analyzing profiler outputs, leveraging technical documentation, implementing kernel-level optimizations, and verifying improvements across benchmarks with rigorous semantic and performance testing. Figure 2

Figure 2: Schematic illustration of the agentic loop in AVO, encompassing planning, tool use, persistent memory, and direct interaction with feedback and resources.

A single AVO variation step subsumes sampling, generation, and evaluation. The agent autonomously decides which prior artifacts and knowledge to leverage, adapts optimization strategies as the search trajectory evolves, and commits new versions only upon passing correctness and exhibiting performance gains. This continuous, multi-day agentic process is safeguarded against stagnation through built-in self-supervision and conditional search redirection.

Experimental Results: Multi-Head and Grouped-Query Attention

The empirical evaluations focus on multi-head attention (MHA) and grouped-query attention (GQA) forward kernels, benchmarked on NVIDIA B200 hardware using BF16 precision, head dimension 128, and sequence lengths up to 32k tokens. Baselines include cuDNN (NVIDIA's production kernel) and FlashAttention-4 (open-source, state-of-the-art Blackwell-optimized kernel). Figure 3

Figure 3: Throughput (TFLOPS) for multi-head attention forward-pass prefilling on B200, showing AVO outperforms baseline kernels across all key configurations.

AVO surpasses cuDNN by up to 3.5% and FA4 by up to 10.5% in causal MHA, consistently delivering higher throughput across all relevant configurations. In non-causal MHA, gains are modest (+1.8%+1.8\% to +2.4%+2.4\% over cuDNN at large sequence lengths) and within comparable variance at shorter lengths. The agent’s optimizations discovered during MHA evolution transfer robustly to GQA kernels: only 30 minutes of autonomous agentic effort adapts AVO’s MHA kernel for GQA, achieving up to +7.0%+7.0\% over cuDNN and +9.3%+9.3\% over FA4. Figure 4

Figure 4: GQA throughput on B200 for two representative configurations, demonstrating rapid autonomous adaptation and performance gains over both baselines.

Evolution Trajectory and Optimization Patterns

The agentic search trajectory demonstrates intensive exploration over seven days, internally pursuing over 500 optimization directions and committing 40 kernel versions. Throughput improvements are realized as discrete jumps corresponding to architectural innovations:

  • QK-PV interleaving with bitmask causal masking,
  • Single-pass, online softmax computation,
  • Branchless accumulator rescaling with non-blocking memory fences,
  • Correction/MMA pipeline overlap,
  • Register rebalancing across warp groups. Figure 5

    Figure 5: Causal MHA evolution trajectory, tracking geometric mean throughput advances over 40 committed versions.

    Figure 6

    Figure 6: Non-causal MHA trajectory shows compounding gains through agent-discovered schedule and resource refinements.

Diminishing returns are observed as kernel versions progress; initial architectural changes close the majority of the performance gap, while late-stage micro-architectural tuning extracts residual headroom.

Analysis of Agent-Discovered Micro-Architectural Optimizations

The ablation analysis evidences advanced hardware-level optimization strategies performed by the agent:

  • Branchless Accumulator Rescaling: Eliminates warp-synchronization overhead by converting conditional logic to predicated, branchless computation. Measured +8.1%+8.1\% geomean throughput gain on non-causal MHA, +1.6%+1.6\% on causal.
  • Correction/MMA Pipeline Overlap: Enables concurrent execution of correction and matrix multiplication warps, reducing idle time. Provides +1.1%+1.1\% geomean gain.
  • Register Rebalancing: Dynamically reallocates registers from softmax to correction warps based on critical-path profiling, minimizing spill stalls. Gains reach +2.1%+2.1\% (non-causal).

These optimizations require intricate reasoning about GPU architectural resources (registers, memory barriers, warp scheduling, hardware ISA constraints) and reflect the agent’s ability to synthesize solutions beyond superficial code transformations.

Throughput Comparison with Official FA4 Baselines

Figure 7

Figure 7: AVO’s MHA throughput as measured versus official FA4 and cuDNN baselines, confirming consistent gains especially at shorter sequence lengths for causal attention.

The comparison validates the empirical results across multiple benchmarking environments, confirming AVO’s advantage in both internally measured and externally reported baselines.

Implications and Future Research Directions

AVO demonstrates that agentic variation operators enable autonomous discovery of micro-architectural optimizations surpassing hand-engineered kernels, with transferable gains across distinct attention variants. The framework is agnostic to domain, suggesting broad applicability to performance-critical software systems, scientific domains requiring persistent agentic exploration, and diverse hardware platforms.

The results motivate further investigation in several areas:

  • Population-Level Branching: Extending AVO to archive-based or multi-lineage evolutionary regimes for broader exploration and diversity.
  • Automated Knowledge Base Expansion: Incorporating dynamic domain knowledge acquisition to further enhance agent reasoning.
  • Generalization Beyond Attention Kernels: Targeting other primitives (e.g., convolution, GEMM) and custom hardware architectures for comprehensive autonomous optimization.

Conclusion

AVO constitutes a substantive advance in autonomous optimization, replacing static evolutionary variation with persistent, self-directed agentic reasoning. Empirical results on state-of-the-art attention kernels establish significant throughput gains, rapid transferability, and deep architectural innovation. The theoretical and practical implications point toward scalable, agentic search as a paradigm for complex, hardware-aware software engineering and scientific discovery in AI.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about teaching an AI “coding agent” to improve very fast computer programs all by itself. The programs it focuses on power “attention,” a core step in Transformer models (the kind used by many modern AI systems). These programs run on NVIDIA’s latest GPUs (very powerful graphics chips) and are already tuned by experts. The authors show that their AI agent can keep editing, testing, and fixing the code on its own for days and still find ways to make it run faster than the best human-made versions.

Key goals and questions

Here are the main things the authors wanted to find out:

  • Can a self-directed AI agent do more than just suggest code once, and instead act like a smart “variation operator” that plans, edits, tests, and improves code over many steps?
  • Can this agent find real, low-level speedups in highly optimized “attention” kernels on new NVIDIA Blackwell (B200) GPUs?
  • Do the improvements it discovers for one attention style (multi-head attention) carry over to a related style (grouped-query attention) with little extra work?

How they did it (methods, in simple terms)

Think of the agent like a very diligent robot coder in a workshop:

  • It has a toolbox: it can read manuals (GPU documentation), look at old versions of the code, compile and run programs, check correctness, and measure speed.
  • It works in loops: plan a change, make the change, test it, read feedback, and try again. It keeps going for days, building on what worked and dropping what didn’t.
  • It keeps a timeline of “best so far” versions. Only versions that are correct and at least as fast as the current best get saved.
  • If the agent gets stuck or keeps chasing bad ideas, a simple “supervisor” nudges it to try different strategies.

Some terms explained:

  • Attention: a way for AI models to decide which parts of the input are most important to focus on.
  • GPU: a specialized chip that can do tons of math in parallel very quickly.
  • Kernel: a tiny, performance-critical program that runs on the GPU.
  • Throughput/TFLOPS: how much math the program can do per second (higher is faster).

The team ran the agent for 7 days straight on Blackwell B200 GPUs. It repeatedly edited and tested a GPU “attention kernel,” comparing it to top expert-made baselines from cuDNN (NVIDIA’s library) and FlashAttention-4 (a cutting-edge open-source implementation).

What they found

  • The agent discovered faster versions of the attention kernel. In multi-head attention (MHA), it achieved up to 1668 TFLOPS and beat:
    • cuDNN by up to 3.5%
    • FlashAttention-4 by up to 10.5%
  • The improvements weren’t just surface-level. The agent made deep, hardware-aware changes, such as:
    • Removing unnecessary “if” checks so the GPU spends less time waiting (like skipping a stoplight when it’s obviously green every time).
    • Overlapping tasks so different parts of the program work at the same time instead of standing in line (like two cooks working on different steps of a recipe in parallel).
    • Rebalancing how the program uses very fast storage slots called “registers,” so the parts doing the most work get more of them (like giving the busiest teammates more tools).
  • These ideas transferred to grouped-query attention (GQA). With about 30 minutes of extra autonomous editing, the agent produced a GQA version that beat:
    • cuDNN by up to 7.0%
    • FlashAttention-4 by up to 9.3%

Why small percentages matter: When you run these kernels millions or billions of times across big AI systems, a 3–10% speedup saves huge amounts of time and money.

Why it matters

  • It shows AI can be more than a one-shot code writer. As a true “agent,” it can plan, read docs, diagnose problems, and iterate—like a junior engineer that learns by doing and measuring.
  • Even against expert-tuned, state-of-the-art code on the newest hardware, the agent found meaningful improvements. That suggests this approach could help optimize many other performance-critical programs.
  • Faster attention means faster or cheaper AI training and inference, which benefits everything from research to real-world apps.
  • The method is general: the same kind of agent could be used to tune other kernels, libraries, or even different kinds of software and hardware.

In short: the paper introduces a smarter way to search for better code—by making the AI itself the “variation operator” that plans and improves over time. It worked: the agent found real speedups in some of the most optimized AI code out there and transferred its tricks to related tasks with very little extra effort.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains uncertain or unexplored in the paper, framed so that future researchers can act on each item.

  • Population-level evolution is not evaluated: AVO is studied in a single-lineage regime only; there is no empirical comparison to archive-based or island-based population strategies, nor analysis of how branching, diversity maintenance, or archive management would impact search efficiency and solution quality.
  • Absence of controlled ablations isolating “agentic” contributions: The paper does not compare AVO against (a) single-turn LLM generators with identical compute budgets, (b) scripted multi-step LLM workflows, or (c) heuristic mutation operators, leaving the specific gains attributable to autonomy, persistent memory, or tool use unquantified.
  • Limited domain scope (only attention kernels): Claims of generality are not substantiated beyond MHA/GQA on NVIDIA Blackwell; it remains unknown how AVO performs on other kernels (e.g., MLP/activation, layer norm, fused ops, convolutions), other operator classes (sparse, graph), or outside GPU kernel optimization (e.g., compilers, systems).
  • Hardware generalization is untested: Results are only on NVIDIA B200; there is no evaluation on Hopper/Ampere GPUs, AMD GPUs, CPUs, TPUs, or specialized accelerators, nor a study of portability across architectures and driver/CUDA versions.
  • Backward pass and training workloads are omitted: Only forward prefilling throughput is optimized; backward kernels, end-to-end training throughput, and stability under mixed-precision training are not evaluated.
  • Critical inference scenarios are missing: No results for autoregressive decode (KV-cache reuse), variable-length batching, sliding-window/paged attention, cross-attention, or dynamic shapes—all common deployment settings.
  • Narrow configuration space: Benchmarks fix BF16 precision, head dimension 128, and specific sequence lengths totaling 32k tokens; performance for other head dimensions, precisions (FP8/FP16/INT8), larger/smaller contexts, and different batch/sequence mixes is unreported.
  • Limited GQA validation: GQA transfer is demonstrated on two configurations only (group sizes 8 and 4) and for the forward pass; broader coverage (e.g., MQA, other group sizes, backward pass) is not provided.
  • Potential benchmark overfitting: The agent commits only when it improves a fixed benchmark suite, but generalization beyond the chosen configurations is only lightly tested (GQA); there is no held-out shape suite or cross-workload validation to detect over-specialization.
  • No statistical significance analysis: Although averages and standard deviations are mentioned, figures lack error bars and significance tests; speedups near “measurement noise” are not rigorously distinguished from variance.
  • Reproducibility is unclear: The paper does not state whether code, evolved kernels, prompts, knowledge base contents, agent configurations, or seeds will be released; the use of an internal frontier LLM further challenges replication.
  • Compute/energy cost and scalability are unspecified: The number of GPUs/SMs used, parallelism level, energy consumption, and wall-clock to speedup trade-offs are not reported, making it difficult to assess practicality and cost-effectiveness versus expert engineering.
  • Supervisor mechanism is underspecified: The “self-supervision” component that detects stalls and reorients search is not detailed (criteria, algorithms, thresholds), leaving its contribution and reproducibility unknown.
  • Evaluation metrics are single-objective: Optimization targets only TFLOPS; impacts on latency (especially at small batch), memory footprint, peak allocation, power, thermal throttling, and determinism are not measured or jointly optimized.
  • Robustness of correctness checks is not detailed: Tolerances, test distributions, edge cases (extreme values, long sequences), and cross-precision consistency are not described; only forward numerical correctness is enforced, with no formal verification or stress-testing.
  • Stability across toolchain changes is untested: Kernel performance and correctness under different CUDA versions, compilers, drivers, and microcode revisions are not evaluated; resilience to future hardware/driver updates is unknown.
  • Portability risks from micro-architectural tuning: Several optimizations exploit Blackwell-specific behavior (e.g., register budgets, fences); the extent to which these kernels degrade or fail on other architectures is not investigated.
  • Risk of brittle code is not assessed: The maintainability, readability, and long-term supportability of agent-evolved kernels are not analyzed (e.g., complexity, reliance on undefined behavior, sensitivity to small codegen changes).
  • Limited profiler-driven transparency: While Nsight-like profiling is implied, the paper does not disclose full kernel-level metrics (occupancy, SM efficiency, memory throughput, stall reasons) for baseline vs AVO, constraining diagnostic replicability.
  • Fairness of baseline tuning is uncertain: It is unclear whether FA4/cuDNN were exhaustively tuned for the exact shapes, driver versions, and flags used; details on autotuning budgets and environment parity are missing.
  • Seed program dependency and knowledge-base bias: The starting implementation(s) and contents of the domain-specific knowledge base (including FA4 code) may strongly steer outcomes; dependence on pre-existing high-quality kernels is not quantified.
  • LLM dependence and model choice are opaque: The specific models, context lengths, function calling/tools, and memory mechanisms are not disclosed; it is unknown whether comparable results are achievable with open-source models.
  • Failure mode characterization is limited: Beyond qualitative notes on stalling/cycling, the paper lacks quantitative analyses of failure frequency, recovery rates, or the distribution of dead-end explorations.
  • No theoretical framing or guarantees: The agentic operator is described operationally without convergence analysis, sample-efficiency theory, or connections to bandit/MDP formulations that could guide principled improvements.
  • Integration into compilers and autotuners is unexplored: How AVO interplays with or augments systems like TVM/Ansor, Triton, CUTLASS, or vendor autotuners (e.g., seeding schedules, hybrid search) is not evaluated.
  • Multi-objective and constrained optimization is unaddressed: Incorporating constraints (e.g., memory caps, latency targets, determinism, numerical stability) or Pareto-front search is not demonstrated.
  • Multi-kernel and graph-level co-optimization is absent: Potential gains from jointly optimizing interacting kernels (e.g., fused attention + MLP, epilogue fusion) or end-to-end graph schedules are not studied.
  • Scaling beyond days-long runs: The effect of longer/shorter runs, adaptive budget allocation, or parallel agent ensembles on final performance and search diversity is not quantified.
  • Security and safety of autonomous tool use: The agent executes shell commands and edits code; sandboxing, permissioning, and mitigations against destructive actions or data exfiltration are not described.
  • IP/compliance considerations for evolved kernels: Given the use of proprietary docs and code references, the legal and licensing status of agent-generated kernels and their redistributability is not discussed.

Practical Applications

Immediate Applications

Below are deployable use cases that can be adopted with today’s tools and hardware, grounded directly in the paper’s results on NVIDIA Blackwell (B200) GPUs and the AVO workflow.

  • Bold-line production speedups for Transformer inference (prefill path) — Sectors: software, AI infrastructure, cloud — Tools/products/workflows: integrate the AVO-evolved MHA/GQA kernels into TensorRT-LLM, PyTorch extensions, or custom inference servers to shave 3–10%+ latency/cost on B200 for BF16 attention; create CI jobs that swap in AVO kernels for specific shapes (e.g., head dim 128, 16 heads) where they outperform cuDNN/FA4 — Assumptions/dependencies: access to B200 GPUs; correctness harnesses and numeric tolerance checks; guardrails for inline PTX; workloads match or are close to the benchmarked regimes (forward prefilling, BF16)
  • Rapid adaptation of attention variants (e.g., GQA) via agent-led retargeting — Sectors: software, AI infrastructure — Tools/products/workflows: “AVO adaptation jobs” that auto-port an optimized MHA kernel to GQA or minor model variants during model rollout; workflow triggered after model shape changes — Assumptions/dependencies: domain-specific knowledge base (KB) describing the variant; reproducible benchmarking scripts; agent sandboxing
  • Continuous performance regression triage and auto-remediation — Sectors: software, DevOps/MLOps — Tools/products/workflows: add an AVO-based “perf guard” stage in CI/CD that (a) runs Nsight Compute/FA4 benchmarks, (b) detects regressions, (c) launches a bounded agentic optimization loop to recover or beat baseline — Assumptions/dependencies: stable perf baselines; profiler access on dedicated runners; strict time/compute quotas; automatic rollback
  • Targeted micro-architectural optimization campaigns for hot kernels — Sectors: software, HPC, AI systems — Tools/products/workflows: run 48–168 hour AVO sprints focused on specific ops (e.g., KV-cache update, RMSNorm, rotary embeddings, small GEMMs, fused epilogues), harvesting agent-discovered tactics (branchless rescaling, pipeline overlap, register rebalancing) into CUTLASS/Triton templates — Assumptions/dependencies: high-quality test oracles; hardware counters; expert review for safety and maintainability
  • Compiler/autotuner augmentation (TVM/Ansor, Triton, CUTLASS) — Sectors: software, compilers — Tools/products/workflows: plug AVO in as a learned “variation backend” to explore schedule/codegen changes beyond parameter sweeps; export agent commits as reproducible schedule recipes — Assumptions/dependencies: integration APIs; deterministic builds; search budgets; legal clarity for generated code reuse
  • Vendor and library feedback loops (kernel template hardening) — Sectors: semiconductors, systems software — Tools/products/workflows: mine AVO trajectories to extract portable heuristics (e.g., branchless correction paths, warp-group register budgets) and upstream them into cuDNN/CUTLASS/TensorRT templates — Assumptions/dependencies: access to internal/perf docs; IP review; reproducibility across driver versions
  • Academic teaching labs for GPU optimization — Sectors: education, academia — Tools/products/workflows: course modules where students inspect agent trajectories (commits, profiler traces, design notes) to learn warp specialization, fencing, and scheduling; “agent vs. human” lab assignments — Assumptions/dependencies: sanitized datasets of agent logs; affordable GPU time; simplified harnesses
  • Open benchmarking and reproducible research baselines — Sectors: academia, open-source — Tools/products/workflows: release “AVO-bench for kernels” with standardized harnesses (like FA4’s scripts), versioned seeds, and scoring; enable fair comparison of agentic search methods — Assumptions/dependencies: permissive licensing for seeds/harnesses; hosted artifacts and runner scripts
  • Datacenter energy and cost efficiency tuning — Sectors: energy, cloud ops, policy within organizations — Tools/products/workflows: treat AVO optimizations as a “software efficiency credit”; incorporate into fleet planning to reduce energy/cost per token served — Assumptions/dependencies: the organization’s workload mix includes B200 attention-heavy inference; metering that ties micro-speedups to energy KPIs
  • Developer-quality-of-life via pre-optimized open kernels — Sectors: daily life (developers), open-source — Tools/products/workflows: ship prebuilt wheels/whls with AVO kernels for popular shapes; toggle via an environment flag — Assumptions/dependencies: compatibility with user drivers/CUDA; fallbacks to vendor kernels; shape guards

Long-Term Applications

Below are higher-impact directions that need further research, scaling, or engineering before broad deployment.

  • End-to-end “self-driving performance engineer” for the ML stack — Sectors: software, AI infrastructure — Tools/products/workflows: an always-on agent that tunes attention, matmul fusions, data loaders, NCCL collectives, and caching strategies across training and inference stacks — Assumptions/dependencies: richer KBs spanning compilers, kernels, runtimes, distributed systems; robust multi-objective scoring (throughput, memory, accuracy, cost)
  • Cross-architecture retargeting (AMD ROCm, Apple/Qualcomm NPUs, TPUs, custom ASICs) — Sectors: semiconductors, edge, mobile — Tools/products/workflows: AVO variants that learn ISA-, memory-, and scheduler-specific playbooks for each backend; auto-port attention optimizations across vendors — Assumptions/dependencies: access to toolchains/profilers/docs; vendor cooperation; different precision and memory models; retooled test oracles
  • Population-level agentic evolution and design-space exploration — Sectors: software, research platforms — Tools/products/workflows: extend single-lineage AVO to island/archive regimes for broader, parallel exploration with novelty search and automated deduplication — Assumptions/dependencies: orchestration at scale; result merging; diversity metrics; robust failure isolation
  • Automated co-design of algorithms and kernels — Sectors: software, research — Tools/products/workflows: agents that propose algorithmic refactors (e.g., attention tiling strategies, online softmax variants) jointly with kernel schedules to unlock new performance frontiers — Assumptions/dependencies: flexible correctness specs; multi-run ablations; acceptance criteria for algorithmic changes in production models
  • Hardware–software feedback for next-gen ISAs and SM micro-architecture — Sectors: semiconductors — Tools/products/workflows: mine agent-discovered pain points (fences, register budgets, warp roles) to inform future tensor core ops, barriers, or TMA features; “AVO-in-the-loop” micro-arch design — Assumptions/dependencies: simulators, pre-silicon models; IP firewalls; long hardware lead times
  • Verified, safety-certified agentic optimization frameworks — Sectors: policy, compliance, critical infrastructure — Tools/products/workflows: formal guardrails (capability sandboxing, code signing, provenance tracking), conformance test suites, and audit trails for autonomous code changes — Assumptions/dependencies: standards bodies buy-in; third-party certification; reproducible builds; secure runners
  • Generalization to training/backward passes and mixed-precision regimes — Sectors: AI infrastructure — Tools/products/workflows: extend AVO to backward attention, optimizer steps, and different dtypes (FP8/INT8) with calibration-aware correctness — Assumptions/dependencies: numerics-sensitive oracles; stability checks; more complex kernels and dataflows
  • Self-optimizing cloud fleets and per-SKU specialization — Sectors: cloud, operations — Tools/products/workflows: fleet services that auto-learn best kernel variants per GPU SKU and workload pattern; dynamic dispatch guided by online telemetry — Assumptions/dependencies: safe canarying; live A/B; model- and tenant-aware policy; version skew handling
  • Autonomously maintained performance templates for compilers — Sectors: compilers, OSS ecosystems — Tools/products/workflows: agents that continuously refresh Triton/TVM/CUTLASS schedules as drivers/firmware change; PR bots with full ablation reports — Assumptions/dependencies: maintainer trust processes; reproducibility across environments; governance for generated code
  • Application to non-ML HPC domains (CFD, genomics, weather, sparse linear algebra) — Sectors: HPC, scientific computing — Tools/products/workflows: AVO agents specializing on domain kernels (FFT, sparse SpMM, stencil ops), integrating domain-specific correctness suites and physics constraints — Assumptions/dependencies: high-fidelity test harnesses; mixed CPU–GPU pipelines; domain-expert-curated KBs
  • On-device/edge continuous optimization for robotics and AR/VR — Sectors: robotics, edge, consumer devices — Tools/products/workflows: constrained AVO agents that adapt kernels to device thermal envelopes and battery limits; periodic offline re-optimization pushed via updates — Assumptions/dependencies: resource-constrained agents; strict safety limits; opaque vendor drivers
  • Economic and environmental policy levers for software efficiency — Sectors: policy, sustainability — Tools/products/workflows: encourage adoption of autonomous optimization (e.g., procurement guidelines rewarding verified efficiency gains; carbon accounting that credits software-side savings) — Assumptions/dependencies: standardized measurement protocols; third-party verification; interoperability with existing ESG reporting

Notes on Key Dependencies and Assumptions Spanning Many Applications

  • Hardware and toolchain access: Results are demonstrated on NVIDIA B200 with CUDA 13.1; porting requires equivalent profilers (Nsight), compilers, and ISA docs on other platforms.
  • Correctness and safety: High-quality, shape-complete test oracles; numeric stability checks for different masks/precisions; sandboxed execution of generated code.
  • Compute budget and time: Multi-day autonomous runs (e.g., 7 days) require scheduled GPU time and job resilience; production integrations need tight caps and rollback plans.
  • Knowledge base quality: Up-to-date CUDA/PTX/architecture references and baseline sources (e.g., FA4) meaningfully improve agent performance.
  • Licensing/IP: Clarity around reuse of generated code and upstreaming into vendor or open-source libraries.
  • Scope alignment: The paper focuses on forward-pass prefilling attention; extrapolations to training, other ops, or other hardware are promising but require validation.

Glossary

  • Agentic Variation Operators (AVO): A family of evolutionary variation operators that replace fixed mutations/crossovers with an autonomous coding agent capable of planning, tool use, and iterative self-improvement. "Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents."
  • Barrier-based signaling: A synchronization mechanism where threads/warps coordinate progress via barriers. "with barrier-based signaling to coordinate handoffs."
  • BF16 precision: A 16-bit floating-point format (bfloat16) commonly used to accelerate deep learning on modern hardware. "achieving up to 1668 TFLOPS at BF16 precision"
  • Blackwell (B200) GPUs: NVIDIA’s Blackwell-generation GPUs (model B200) targeted for advanced AI workloads. "on NVIDIA Blackwell (B200) GPUs."
  • Boltzmann selection: An evolutionary selection strategy that samples candidates probabilistically based on fitness using a Boltzmann distribution. "LoongFlow similarly relies on a MAP-Elites archive with Boltzmann selection for SampleSample"
  • Branchless accumulator rescaling: An optimization that removes conditional branches when rescaling accumulators, reducing divergence and synchronization overhead. "Branchless accumulator rescaling"
  • Bounded archive: A population archive with limited size that prunes lower-scoring members to maintain capacity. "possibly pruning low-score members to maintain a bounded archive."
  • Causal attention: Attention with a causal mask preventing each token from attending to future tokens. "For causal attention, some K-block iterations are fully masked (no valid attention entries) and others are fully unmasked, leading to different execution paths within the same kernel."
  • Correction warps: Specialized warps responsible for rescaling outputs during online softmax or normalization steps in attention kernels. "Correction warps rescale the output accumulator OO when the running maximum changes across K-block iterations (a requirement of the online softmax algorithm)."
  • CUDA: NVIDIA’s parallel computing platform and programming model for GPUs. "each xix_i is a CUDA kernel implementation (source code with inline PTX)"
  • cuDNN: NVIDIA’s library of highly optimized primitives for deep learning, including attention kernels. "We compare against two state-of-the-art baselines: (1) cuDNN: NVIDIA's closed-source attention kernel"
  • Domain-specific knowledge base: A curated set of references (docs, specs, code) relevant to a particular technical domain. "a domain-specific knowledge base"
  • Dual Q-stage design: A pipelining strategy where two query tiles are processed concurrently to increase utilization. "these groups operate concurrently across two Q-tiles (a dual Q-stage design), with barrier-based signaling to coordinate handoffs."
  • Epilogue warps: Warps that handle the final stages of a kernel’s computation, often including data movement or writeback. "Load and epilogue warps handle data movement via the Tensor Memory Accelerator (TMA)."
  • Evolutionary search: An optimization paradigm that iteratively improves a population of candidates using variation and selection. "Evolutionary search optimizes over a space of candidates by maintaining a population P\mathcal{P} and iteratively expanding it with new solutions"
  • FlashAttention-4 (FA4): A state-of-the-art attention kernel family optimized for modern NVIDIA GPUs. "both FlashAttention-4 (FA4) and cuDNN requiring months of manual optimization on the latest Blackwell architecture."
  • Forward-pass prefilling throughput: The throughput metric (often in TFLOPS) for the forward prefilling phase of attention inference. "Multi-head attention forward-pass prefilling throughput (TFLOPS) on NVIDIA B200 with head dimension 128, 16 heads, and BF16 precision."
  • Geometric mean throughput: A performance aggregation metric that uses the geometric mean across configurations to summarize throughput. "The solid green line tracks the running-best geometric mean throughput across all configurations;"
  • Grouped-query attention (GQA): An attention variant where multiple query heads share a smaller set of key-value heads to reduce memory/compute. "The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation"
  • Inline PTX: Embedding NVIDIA’s PTX assembly directly within CUDA C++ for fine-grained control. "each xix_i is a CUDA kernel implementation (source code with inline PTX)"
  • Instruction pipeline scheduling: The arrangement of instructions to optimize overlapping execution stages and minimize stalls on the hardware pipeline. "including register allocation, instruction pipeline scheduling, and workload distribution,"
  • Island-based evolutionary database: A population structure where multiple sub-populations (“islands”) evolve in parallel, exchanging solutions occasionally. "AlphaEvolve maintains an island-based evolutionary database inspired by MAP-Elites"
  • MAP-Elites: An archive-based evolutionary algorithm that maintains diverse high-performing solutions across behavior niches. "inspired by MAP-Elites"
  • Memory-bound: A regime where performance is limited by memory bandwidth/latency rather than compute capacity. "making the operation memory-bound for large sequence lengths NN."
  • Micro-architectural optimizations: Low-level hardware-aware refinements (e.g., registers, synchronization, scheduling) to improve performance. "micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations"
  • MMA warps: Warps dedicated to matrix multiply-accumulate operations using tensor cores. "MMA warps execute the two core matrix multiplications via Blackwell's tensor core instructions"
  • Non-blocking fence: A memory-ordering primitive that enforces ordering without waiting for completion of prior memory operations. "with a lighter non-blocking fence that merely enforces ordering."
  • Non-causal attention: Attention without causal masking, allowing every token to attend to all others. "On non-causal attention, AVO achieves modest gains at longer sequences"
  • Online softmax: An incremental softmax computation that maintains running statistics to avoid materializing full intermediate results. "applying the online softmax algorithm with a running row-maximum."
  • PTX ISA: NVIDIA’s Parallel Thread Execution instruction set architecture for GPU assembly-level programming. "PTX ISA documentation"
  • PUCT-based selection rule: A tree search policy (Predictor + UCT) guiding which nodes/candidates to expand based on value and exploration. "a PUCT-based selection rule determines which states to expand,"
  • PV GEMM: The matrix multiplication of softmax probabilities P with V in attention (i.e., P times V). "and the PV GEMM (multiplying the softmax output P=softmax(S)P = \mathrm{softmax}(S) by VV to accumulate the output OO)."
  • Q-tiles: Tiled partitions of the query matrix processed in stages to enable pipelining and parallelism. "these groups operate concurrently across two Q-tiles (a dual Q-stage design)"
  • QK GEMM: The matrix multiplication of Q and KT to compute attention scores. "the QK GEMM (producing scores SS)"
  • QK-PV interleaving: An execution strategy that interleaves QK and PV phases to improve pipeline utilization. "the introduction of QK-PV interleaving with bitmask causal masking (version 8)"
  • Register allocation: The assignment of variables to limited hardware registers to minimize spills and maximize throughput. "including register allocation, instruction pipeline scheduling, and workload distribution,"
  • Register rebalancing across warp groups: Adjusting register budgets among warp groups to reduce spills and balance critical paths. "Register Rebalancing Across Warp Groups"
  • Running row-maximum: The current maximum per row maintained during online softmax to ensure numerical stability. "maintaining a running softmax (with running row-maximum and row-sum)"
  • Scoring function: The evaluation function that measures candidate quality (e.g., correctness and performance). "where f\mathbf{f} is a scoring function that evaluates each candidate solution."
  • Single-lineage setting: An evolutionary regime that maintains and improves a single chain of solutions rather than a diverse population. "In this paper we study the single-lineage setting to isolate the effect of the operator itself."
  • Softmax warps: Warps dedicated to softmax computation within the attention kernel pipeline. "Softmax warps compute attention weights PP from the scores SS, applying the online softmax algorithm with a running row-maximum."
  • Tensor core instructions: Specialized GPU instructions for matrix operations on tensor cores to accelerate GEMMs. "via Blackwell's tensor core instructions"
  • Tensor Memory Accelerator (TMA): A hardware-assisted data movement engine used to efficiently transfer tensors. "Load and epilogue warps handle data movement via the Tensor Memory Accelerator (TMA)."
  • TFLOPS: Trillions of floating-point operations per second, a throughput metric for compute performance. "producing MHA kernels achieving up to 1668 TFLOPS at BF16 precision"
  • Tiling: Decomposing computation into blocks/tiles to improve locality and reduce memory overhead. "This tiling eliminates the need to store the full score matrix, shifting the bottleneck from memory bandwidth to compute throughput on modern GPUs."
  • Warp divergence: When threads within a warp follow different control paths, reducing efficiency. "the agent also removed warp divergence in the correction path"
  • Warp groups: Subsets of warps within a thread block assigned to different roles for specialization. "different warp groups within a thread block are assigned distinct roles in the attention pipeline."
  • Warp specialization: Assigning distinct functional roles to different warps to pipeline and parallelize kernel stages. "employ warp specialization"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 15 tweets with 125 likes about this paper.