Papers
Topics
Authors
Recent
Search
2000 character limit reached

ANEForge: Python for direct computation on the Apple Neural Engine

Published 12 Jun 2026 in cs.PL, cs.AI, and cs.MS | (2606.17090v1)

Abstract: ANEForge is a Python package that programs the Apple Neural Engine (ANE), the fixed-function neural accelerator on every recent Apple device, directly and without CoreML. In production the engine is reachable only through CoreML, which treats it as a scheduling option: no configuration requires the ANE, and a model can silently run on the CPU or GPU instead. ANEForge compiles a lazy tensor graph, built from 58 fused operators and 19 native bridge operators, into a single ANE program. The program is dispatched through the same ANE daemon and kernel-driver stack as Apple's internal framework. Beyond inference, the package reaches the engine's native fused attention, streams int8, int4, and sparse weights, keeps decoder and optimizer state resident across steps, and runs the forward pass, backward pass, and optimizer update of training on the engine. A small fused program completes a call in about 90us, near the engine's 70us per-program dispatch floor, and a pretrained ResNet-18 forward runs end-to-end in 0.33ms. ResNet-18, a sentence encoder, and a Vision Transformer run end-to-end against framework references, and a Stable Diffusion U-Net validates its forward pass. ANEForge targets Apple Silicon under macOS 14 and later. Each release is verified against a recorded macOS and ANE-compiler version.

Authors (1)

Summary

  • The paper presents a novel Python package that directly compiles and dispatches neural network graphs on the Apple Neural Engine, bypassing CoreML.
  • It employs a lazy tensor graph paradigm, fusing multiple operators into a single ANE program to optimize performance and reduce latency.
  • Empirical results demonstrate significant improvements in inference speed and power consumption for models like ResNet-18 over traditional processors.

ANEForge: Pythonic Direct Compilation and Dispatch for the Apple Neural Engine

Overview and Key Contributions

ANEForge introduces a Python package for direct compilation and dispatch to the Apple Neural Engine (ANE), bypassing the abstraction and runtime heuristics of CoreML. This framework exposes the native capabilities of the ANE, providing deterministic model execution on Apple's fixed-function neural accelerator present in all recent Apple devices. ANEForge establishes a pipeline where a lazy tensor graph—composed from a set of 58 fused and 19 native bridge operators—is compiled directly into a single ANE program and dispatched through private, undocumented APIs leveraged internally by Apple's frameworks.

Distinct from alternative approaches that rely on reverse-engineering or proprietary entitlements, ANEForge operates unentitled and without requiring modification of system integrity protections. Its architectural entry point is Apple's internal Espresso runtime (e5rt), which connects to the aned daemon and kernel drivers, ensuring compatibility and program signing through the official ANE stack.

Technical Architecture

ANEForge’s software architecture is organized around a lazy-graph paradigm, where tensor computations are staged and optimized prior to execution. The graph is lowered to the Model Intermediate Language (MIL) consumed by the ANE’s compiler, and both operator sequences and compressed weights are fused into a monolithic binary. The Python frontend utilizes minimal Objective-C++ shims to interface directly with private Apple libraries, dynamically resolving the required symbols at runtime.

The operator surface includes common ML-primitives (convolution, normalization, activation, attention, reductions, shape manipulation), and is verified against a registry for hardware compatibility. A two-path lowering exists: fused MIL for the majority of computation, with a graph-cut bridge to access lower-level operators otherwise unreachable via public compilers. This registry is maintained with machine-checked conformance tests, providing a layer of reproducibility and continuous validation.

A cost model leveraging empirical hardware measurements is integrated into the compilation process, enabling aggressive operator fusion and autotuning for latency on the device. The provided reverse-mode autograd framework emits grad-operators, making forward, backward, and optimizer steps executable on the ANE, thus supporting both inference and full on-device training cycles.

Numerical Performance and Validation

Numerical validation is comprehensive: all operators and model outputs are systematically checked against NumPy or framework (PyTorch/Transformers) references on real Apple Silicon hardware. Key empirical results include:

  • Dispatch Latency: Minimal overhead, with small fused programs executed in approximately 90 μs (near the ANE’s dispatch floor of 70 μs).
  • ResNet-18 Inference: End-to-end forward pass on ANE completes in 0.33 ms, compared to 2.0 ms on the M-series GPU (torch-MPS, float32) and 6.0 ms on the CPU (PyTorch).
  • Power Consumption: ANE runtime draws 4.5 W (M5 Pro, macOS 26.5), lower than comparable GPU scenarios.
  • Compression: Weight streaming/path supports int8, int4, and sparse encodings; e.g., 4096×4096 matmuls reduce weight blobs from 33.6 MB (fp16) to 8.4 MB (int8).

Functionally, ResNet-18, Vision Transformers, MiniLM sentence encoders, Stable Diffusion U-Nets, and variational autoencoders are validated for bitwise or near-identical correspondence to reference implementations. Training on-chip yields only 1.7 points lower test accuracy (CIFAR-10) compared to PyTorch, and gradient calculations attain cosine similarity of 1.0.

The attention operator is directly mapped to the ANE’s native fused attention layer, achieving throughput improvements of 3.7–5.3× over non-fused decompositions. The autograd architecture incorporates numerically stable, half-precision gradient propagation.

Functional Scope and Research Implications

ANEForge unlocks research and practical directions beyond inference. It enables:

  • Precise Hardware Placement: Models are deterministically run on the ANE, eliminating fallback behavior that can obfuscate power/latency studies.
  • Operator Census and Characterization: Enables exhaustive, machine-checked compatibility testing—critical for hardware architectural studies and compiler reverse engineering.
  • On-Device Training and Personalization: Maintains optimizer and KV cache state resident on-chip, allowing sustained, privacy-preserving training scenarios.
  • Compression and Mixed-Precision Exploration: Compressed weight encoding enables memory-constrained deployment, and fixed-iteration, static-dateflow numerical methods (e.g., Krylov solvers, FFTs) compile and execute efficiently on ANE hardware.
  • Open Scientific Computation: The framework demonstrates that dense numerical linear algebra with bounded iteration, as well as spectral methods, are feasible on ANE within the constraints of half precision.

The modular and extensible Pythonic interface, coupled with minimal direct dependencies (only NumPy), invites further community-driven exploration—be it in extending the operator registry, protocol measurement, or hardware-constrained deployment studies.

Limitations and Future Work

The primary limitation is the reliance on private, undocumented Apple symbols, which impose no API contract; this creates a dependency on reverse-engineered or observed internal interfaces that may break across OS or firmware updates. Each release must be validated against concrete OS/compiler versions and tested via a comprehensive corpus gate. Distribution through official application channels is precluded due to linkage with private frameworks.

Numerical computation remains constrained to fp16 throughout the ANE’s datapath—limiting certain low-signal scientific workloads (e.g., classifier-free diffusion guidance) and precluding a fully generic scientific computing backend. The hardware supports only a single-program dispatch lane, and thus all parallelism is extracted through graph fusion, not concurrent execution.

Interoperability is limited to Apple Silicon hardware within a known compatibility window, with no support for x86 macOS or Linux (except through Asahi’s community drivers at a much lower abstraction).

Conclusion

ANEForge provides the first Python package for deterministic, direct compilation and execution of computational graphs on the Apple Neural Engine, circumventing CoreML’s nondeterminism and abstraction. The package supports a wide range of neural operators, hardware-specific fused paths (notably attention), compressed and sparse weights, stateful on-chip execution for both inference and training, and efficient dispatch at the limits of the hardware’s latency floor. The thorough operator census, open measurement tools, and validation corpus establish ANEForge as a foundation for both applied research—benchmarking, characterization, mixed-precision studies—and for the exploration of on-device, low-power machine learning and scientific computation under real-world hardware constraints.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.