- The paper presents a novel Python package that directly compiles and dispatches neural network graphs on the Apple Neural Engine, bypassing CoreML.
- It employs a lazy tensor graph paradigm, fusing multiple operators into a single ANE program to optimize performance and reduce latency.
- Empirical results demonstrate significant improvements in inference speed and power consumption for models like ResNet-18 over traditional processors.
ANEForge: Pythonic Direct Compilation and Dispatch for the Apple Neural Engine
Overview and Key Contributions
ANEForge introduces a Python package for direct compilation and dispatch to the Apple Neural Engine (ANE), bypassing the abstraction and runtime heuristics of CoreML. This framework exposes the native capabilities of the ANE, providing deterministic model execution on Apple's fixed-function neural accelerator present in all recent Apple devices. ANEForge establishes a pipeline where a lazy tensor graph—composed from a set of 58 fused and 19 native bridge operators—is compiled directly into a single ANE program and dispatched through private, undocumented APIs leveraged internally by Apple's frameworks.
Distinct from alternative approaches that rely on reverse-engineering or proprietary entitlements, ANEForge operates unentitled and without requiring modification of system integrity protections. Its architectural entry point is Apple's internal Espresso runtime (e5rt), which connects to the aned daemon and kernel drivers, ensuring compatibility and program signing through the official ANE stack.
Technical Architecture
ANEForge’s software architecture is organized around a lazy-graph paradigm, where tensor computations are staged and optimized prior to execution. The graph is lowered to the Model Intermediate Language (MIL) consumed by the ANE’s compiler, and both operator sequences and compressed weights are fused into a monolithic binary. The Python frontend utilizes minimal Objective-C++ shims to interface directly with private Apple libraries, dynamically resolving the required symbols at runtime.
The operator surface includes common ML-primitives (convolution, normalization, activation, attention, reductions, shape manipulation), and is verified against a registry for hardware compatibility. A two-path lowering exists: fused MIL for the majority of computation, with a graph-cut bridge to access lower-level operators otherwise unreachable via public compilers. This registry is maintained with machine-checked conformance tests, providing a layer of reproducibility and continuous validation.
A cost model leveraging empirical hardware measurements is integrated into the compilation process, enabling aggressive operator fusion and autotuning for latency on the device. The provided reverse-mode autograd framework emits grad-operators, making forward, backward, and optimizer steps executable on the ANE, thus supporting both inference and full on-device training cycles.
Numerical validation is comprehensive: all operators and model outputs are systematically checked against NumPy or framework (PyTorch/Transformers) references on real Apple Silicon hardware. Key empirical results include:
- Dispatch Latency: Minimal overhead, with small fused programs executed in approximately 90 μs (near the ANE’s dispatch floor of 70 μs).
- ResNet-18 Inference: End-to-end forward pass on ANE completes in 0.33 ms, compared to 2.0 ms on the M-series GPU (torch-MPS, float32) and 6.0 ms on the CPU (PyTorch).
- Power Consumption: ANE runtime draws 4.5 W (M5 Pro, macOS 26.5), lower than comparable GPU scenarios.
- Compression: Weight streaming/path supports int8, int4, and sparse encodings; e.g., 4096×4096 matmuls reduce weight blobs from 33.6 MB (fp16) to 8.4 MB (int8).
Functionally, ResNet-18, Vision Transformers, MiniLM sentence encoders, Stable Diffusion U-Nets, and variational autoencoders are validated for bitwise or near-identical correspondence to reference implementations. Training on-chip yields only 1.7 points lower test accuracy (CIFAR-10) compared to PyTorch, and gradient calculations attain cosine similarity of 1.0.
The attention operator is directly mapped to the ANE’s native fused attention layer, achieving throughput improvements of 3.7–5.3× over non-fused decompositions. The autograd architecture incorporates numerically stable, half-precision gradient propagation.
Functional Scope and Research Implications
ANEForge unlocks research and practical directions beyond inference. It enables:
- Precise Hardware Placement: Models are deterministically run on the ANE, eliminating fallback behavior that can obfuscate power/latency studies.
- Operator Census and Characterization: Enables exhaustive, machine-checked compatibility testing—critical for hardware architectural studies and compiler reverse engineering.
- On-Device Training and Personalization: Maintains optimizer and KV cache state resident on-chip, allowing sustained, privacy-preserving training scenarios.
- Compression and Mixed-Precision Exploration: Compressed weight encoding enables memory-constrained deployment, and fixed-iteration, static-dateflow numerical methods (e.g., Krylov solvers, FFTs) compile and execute efficiently on ANE hardware.
- Open Scientific Computation: The framework demonstrates that dense numerical linear algebra with bounded iteration, as well as spectral methods, are feasible on ANE within the constraints of half precision.
The modular and extensible Pythonic interface, coupled with minimal direct dependencies (only NumPy), invites further community-driven exploration—be it in extending the operator registry, protocol measurement, or hardware-constrained deployment studies.
Limitations and Future Work
The primary limitation is the reliance on private, undocumented Apple symbols, which impose no API contract; this creates a dependency on reverse-engineered or observed internal interfaces that may break across OS or firmware updates. Each release must be validated against concrete OS/compiler versions and tested via a comprehensive corpus gate. Distribution through official application channels is precluded due to linkage with private frameworks.
Numerical computation remains constrained to fp16 throughout the ANE’s datapath—limiting certain low-signal scientific workloads (e.g., classifier-free diffusion guidance) and precluding a fully generic scientific computing backend. The hardware supports only a single-program dispatch lane, and thus all parallelism is extracted through graph fusion, not concurrent execution.
Interoperability is limited to Apple Silicon hardware within a known compatibility window, with no support for x86 macOS or Linux (except through Asahi’s community drivers at a much lower abstraction).
Conclusion
ANEForge provides the first Python package for deterministic, direct compilation and execution of computational graphs on the Apple Neural Engine, circumventing CoreML’s nondeterminism and abstraction. The package supports a wide range of neural operators, hardware-specific fused paths (notably attention), compressed and sparse weights, stateful on-chip execution for both inference and training, and efficient dispatch at the limits of the hardware’s latency floor. The thorough operator census, open measurement tools, and validation corpus establish ANEForge as a foundation for both applied research—benchmarking, characterization, mixed-precision studies—and for the exploration of on-device, low-power machine learning and scientific computation under real-world hardware constraints.