BaseRT: Apple Silicon LLM Runtime

Updated 4 July 2026

BaseRT is a native Metal inference runtime for large language models on Apple Silicon, designed to recover performance through chip-specific kernel fusion, unified memory-aware optimization, and custom dispatch logic.
It bypasses higher-level frameworks by directly issuing Metal command buffers, thereby reducing abstraction overhead and significantly improving tokens-per-second throughput on M-series devices.
Supporting a wide range of models and quantisation formats, BaseRT enables real-time, privacy-sensitive on-device inference by efficiently leveraging Apple Silicon’s unified memory and GPU architecture.

Searching arXiv for BaseRT and closely related Apple Silicon LLM inference work to ground the article. BaseRT is a native Metal inference runtime for LLMs on Apple Silicon that reports the highest inference throughput on this hardware to date (Rathnayaka et al., 1 Jul 2026). It is a purpose-built inference engine with a thin C API, targets Apple’s Metal GPU API directly, and is designed to eliminate abstraction overheads attributed to frameworks not built around Metal’s execution model or Apple Silicon’s unified memory topology. The system supports a wide range of model families across eight quantisation formats, from Q2 to FP16, on all Apple M-series devices, and its reported evaluations focus on the Qwen3, Llama 3.2, and Gemma 4 families at Q4 and Q8 on M3 and M4 Pro devices (Rathnayaka et al., 1 Jul 2026).

1. Positioning and stated objectives

BaseRT is presented as an alternative to existing runtimes, including llama.cpp and MLX-based frameworks, which are said to incur overhead from abstractions not designed for Metal’s execution model or Apple Silicon’s unified memory topology (Rathnayaka et al., 1 Jul 2026). Its design objective is therefore not generic portability, but recovery of performance “that framework-based approaches leave on the table” through three mechanisms named explicitly in the paper: chip-specific kernel fusion, unified memory-aware optimisation, and custom dispatch logic.

The runtime is situated in the context of local LLM execution on Apple Silicon. The paper frames this as part of an “emerging edge inference paradigm,” in which privacy requirements, latency constraints, and cloud cost pressures push inference toward on-device deployment. Within that framing, BaseRT is described as a critical enabling layer because it attempts to maximize tokens-per-second under the hardware constraints of M-series SoCs rather than through a cross-platform abstraction boundary (Rathnayaka et al., 1 Jul 2026).

The scope of support is broad at the level claimed in the paper: BaseRT supports model families from sub-1B to 30B parameters and quantisation formats from Q2 through FP16. The evaluation set is narrower and concrete: Qwen3-0.6B, Llama 3.2-1B and 3B, Gemma 4-E2B, Gemma 4-26B-A4B, and Qwen3-30B-A3B, all benchmarked at Q4 and Q8 on Apple M3 Base and M4 Pro devices (Rathnayaka et al., 1 Jul 2026).

2. Runtime architecture and execution model

BaseRT is written in C++ and issues Metal command buffers directly, bypassing MLX, CoreML, and other high-level array frameworks (Rathnayaka et al., 1 Jul 2026). At startup, it parses a declarative “architecture descriptor” containing attention norms, Mixture-of-Experts routing rules, RoPE conventions, and activation types into a small table of operator sequences. The stated consequence is that the hot path never branches on model identity. Weight tensors are laid out in GPU-friendly, coalesced buffers in unified memory, and during decode the CPU only patches a few small uniform buffers and enqueues a sequence of pre-compiled Metal compute kernels. The paper states that there is zero per-token allocation or graph construction in this path.

Its kernel library is hand-optimized and covers mat-mul, with GEMV used for $M=1$ decode and GEMM for prefill $M>1$ , as well as attention, layer-norm, RoPE, activation, and dequantisation. The architectural emphasis is not only on per-kernel tuning but on fusion of common operator chains. The examples given are “matmul→add bias→layernorm→silu” in the feed-forward network and “ $Q \cdot K^T \to \mathrm{softmax} \to V$ ” in attention. Each fusion is said to save one dispatch overhead, denoted $\tau_{\text{launch}}$ , and one round-trip to global memory (Rathnayaka et al., 1 Jul 2026).

The paper summarizes this with a simple roofline model for a fused kernel:

$T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$

where $O_{\text{fused}}$ is the total FLOPs in the fused chain, $D_{\text{fused}}$ is the total bytes read and written after accounting for in-kernel dequantisation, $BW_{\text{mem}}$ is the unified memory bandwidth, and $\tau_{\text{launch}}$ is the per-dispatch overhead, reported as approximately $10$– $M>1$ 0. For M4 Pro, the paper gives $M>1$ 1 as the example unified memory bandwidth. By keeping $M>1$ 2 small through on-the-fly dequantisation and $M>1$ 3 large through fusion, BaseRT is described as shifting decode kernels from a launch-bound regime into a GPU-compute-bound regime, thereby increasing tokens per second (Rathnayaka et al., 1 Jul 2026).

A second architectural feature is custom dispatch logic built around Metal indirect command buffers. Rather than issuing one command buffer per operator per token, BaseRT groups multiple token decodes into a single ICB whenever possible. The fused kernel sequence is pre-encoded inside the ICB, allowing back-to-back GPU execution without CPU round-trips, while a small dispatcher thread refills ICBs as kernels complete. The stated purpose is amortization of $M>1$ 4 across many tokens (Rathnayaka et al., 1 Jul 2026).

3. Unified memory strategy and quantisation model

BaseRT’s memory model is tightly coupled to Apple Silicon’s shared CPU–GPU physical memory pool. The runtime places weights, KV-cache, activation scratch, and logits buffers in a single MTLHeap, and the paper states that Metal pages only the touched $M>1$ 5 blocks to the GPU. When only a subset of parameters or KV slices is active, the claim is that physical memory traffic is limited to what the GPU actually accesses. The implementation description characterizes this as demand-paged VM without host–GPU copies, with metalBufferDidModifyRange calls used only when patching small uniforms (Rathnayaka et al., 1 Jul 2026).

The paper introduces a simple parameter $M>1$ 6 for the fraction of the model actively accessed per token, giving $M>1$ 7 for small models and $M>1$ 8 for large MoE models with sparse routing. The corresponding effective bandwidth model is

$M>1$ 9

This model is used to explain why decode speedups shrink on very large models: as $Q \cdot K^T \to \mathrm{softmax} \to V$ 0 falls, memory-bandwidth bounds re-emerge (Rathnayaka et al., 1 Jul 2026).

Quantisation support comprises eight formats with bespoke Metal kernels that integrate dequantisation into the inner loop: Q2, Q3, Q4, Q5, Q6, Q8, BF16, and FP16. The paper specifies the following ranges. Q2 is a 2-bit integer format with $Q \cdot K^T \to \mathrm{softmax} \to V$ 1, $Q \cdot K^T \to \mathrm{softmax} \to V$ 2, and representable range approximately $Q \cdot K^T \to \mathrm{softmax} \to V$ 3. Q3 has $Q \cdot K^T \to \mathrm{softmax} \to V$ 4, $Q \cdot K^T \to \mathrm{softmax} \to V$ 5, and range approximately $Q \cdot K^T \to \mathrm{softmax} \to V$ 6. Q4 has $Q \cdot K^T \to \mathrm{softmax} \to V$ 7, $Q \cdot K^T \to \mathrm{softmax} \to V$ 8, and range in $Q \cdot K^T \to \mathrm{softmax} \to V$ 9. Q5 has $\tau_{\text{launch}}$ 0, $\tau_{\text{launch}}$ 1, and range in $\tau_{\text{launch}}$ 2. Q6 has $\tau_{\text{launch}}$ 3, $\tau_{\text{launch}}$ 4, and range in $\tau_{\text{launch}}$ 5. Q8 has $\tau_{\text{launch}}$ 6, $\tau_{\text{launch}}$ 7, and range in $\tau_{\text{launch}}$ 8. BF16 is described as 16-bit brain-float with IEEE-754 $\tau_{\text{launch}}$ 9 exponent and $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 0 mantissa, with range approximately $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 1. FP16 is described as IEEE-754 with $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 2 exponent and $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 3 mantissa, also with range approximately $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 4 (Rathnayaka et al., 1 Jul 2026).

Peak model memory is given by

$T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 5

for a model with $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 6 parameters under $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 7-bit quantisation. The paper’s example is a 30B-parameter model in Q4, which occupies $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 8. Because BaseRT reads quantised weights directly from unified memory and materializes dequantised scalars only in SIMD registers, the paper states that there is no intermediate b16 or b32 weight buffer, so peak memory remains at $T_{\text{fused}} \simeq \min \left( \frac{O_{\text{fused}}}{\tau_{\text{launch}}}, \frac{D_{\text{fused}}}{BW_{\text{mem}}} + \frac{O_{\text{fused}}}{\mathrm{Peak\_FLOP}} \right)$ 9 for scratch and cache (Rathnayaka et al., 1 Jul 2026).

4. Benchmark methodology and reported throughput

The reported evaluation uses two metrics. Decode throughput, denoted tg128, is defined as autoregressive generation of 128 tokens. Prefill throughput, denoted ppL, is defined as prompt processing of $O_{\text{fused}}$ 0 tokens. The benchmark platforms are Apple M3 Base, specified as 8 GPU cores and $O_{\text{fused}}$ 1, and Apple M4 Pro, specified as 16 cores and $O_{\text{fused}}$ 2 (Rathnayaka et al., 1 Jul 2026).

On M4 Pro decode, the highest listed BaseRT number is 464.5 tok/s for Qwen3-0.6B Q4, compared with 297.4 tok/s for llama.cpp and 343.6 tok/s for MLX, corresponding to speedups of $O_{\text{fused}}$ 3 and $O_{\text{fused}}$ 4 respectively. For Qwen3-0.6B Q8, BaseRT is 321.2 tok/s, compared with 219.8 for llama.cpp and 255.3 for MLX. For Llama3.2-1B Q4, the result is 295.4 tok/s versus 230.4 and 257.8; for Llama3.2-1B Q8, 183.8 versus 160.7 and 159.2. For Llama3.2-3B Q4, BaseRT is 117.3 tok/s; for Q8, 70.9 tok/s. For Gemma4-E2B, the paper reports 127.7 tok/s at Q4 and 84.5 tok/s at Q8, with no MLX entries shown. For Gemma4-26B Q4, BaseRT reports 62.2 tok/s, llama.cpp 58.0 tok/s, and MLX 69.3 tok/s. For Qwen3-30B-A3B Q4, BaseRT reports 84.1 tok/s, llama.cpp 80.7 tok/s, and MLX 83.1 tok/s (Rathnayaka et al., 1 Jul 2026).

The paper summarizes these decode results by stating that across all six Q4 models, the speedup over llama.cpp ranges from $O_{\text{fused}}$ 5 for a large MoE model to $O_{\text{fused}}$ 6 for Qwen3-0.6B, and against MLX BaseRT leads by $O_{\text{fused}}$ 7– $O_{\text{fused}}$ 8 except on Gemma4-26B Q4, where MLX narrowly edges BaseRT. This point is significant because it qualifies the “best-in-class” framing: the aggregate claim is strong, but the reported decode table contains at least one M4 Pro configuration in which MLX is faster (Rathnayaka et al., 1 Jul 2026).

For prefill on M4 Pro, the detailed evaluation states that at pp128 on Qwen3-30B-A3B Q4, BaseRT achieves 738 tok/s versus 415 tok/s for MLX, a speedup of $O_{\text{fused}}$ 9. On Gemma4-26B-A4B Q4 at pp128, the paper reports 659 tok/s for BaseRT versus 464 tok/s for MLX, a speedup of $D_{\text{fused}}$ 0. It also states that dense models from 0.6B to 3B show only $D_{\text{fused}}$ 1 differences versus llama.cpp and MLX because GEMM kernels saturate the GPU compute units in all runtimes (Rathnayaka et al., 1 Jul 2026).

A reporting nuance appears in the same source. The abstract states “up to $D_{\text{fused}}$ 2 higher than MLX, with substantially larger margins on prefill for mixture-of-experts models,” while the later “Key Performance Metrics & Recommendation” section reports “Prefill throughput (pp128) on M4 Pro: up to 838 tok/s (Qwen3 30B Q4), $D_{\text{fused}}$ 3 vs MLX.” A plausible implication is that different summary points in the paper refer to different slices of the evaluation or to different reported maxima, but the text as supplied preserves both figures rather than reconciling them (Rathnayaka et al., 1 Jul 2026).

On M3 Base, decode tg128 speedups versus llama.cpp are reported as $D_{\text{fused}}$ 4– $D_{\text{fused}}$ 5 across Q4 and Q8 for 0.6B–3B models, and versus MLX as $D_{\text{fused}}$ 6– $D_{\text{fused}}$ 7. The paper interprets this as confirmation that the design scales consistently across Apple Silicon generations (Rathnayaka et al., 1 Jul 2026).

5. Claimed implications for edge and on-device inference

The paper identifies three drivers for on-device LLM deployment: latency, privacy and resilience, and cost (Rathnayaka et al., 1 Jul 2026). On latency, it states that by avoiding network hops and multi-tenant scheduling, BaseRT on an M4 Pro can produce a first token in under 50 ms for 1B–3B models, enabling sub-100 ms interactive UIs and real-time robotic perception loops. On privacy and resilience, it states that all prompt data and inference state remain in local device memory, and that the absence of a cloud round-trip or shared GPU queue reduces data-exfiltration risk and removes single-point-of-failure cloud outages. On cost, it argues that local inference amortizes hardware CAPEX over infinite inferences, driving per-token cost toward zero, and that high throughput maximizes utilization of existing Apple devices (Rathnayaka et al., 1 Jul 2026).

Within that argument, BaseRT is positioned as shifting the “viability boundary” from approximately 7B models up to approximately 30B parameters at interactive speeds. The paper names several resulting edge use cases: complex multi-agent workflows, offline document summarization, and on-device code completion. It also states that the system “makes light of previous memory-bandwidth limits,” a formulation that should be read as the paper’s interpretation of its throughput results rather than as an independent capacity theorem (Rathnayaka et al., 1 Jul 2026).

The same section also connects BaseRT’s performance profile to Apple Silicon’s role as an inference platform. The central claim is not merely that a native runtime is faster than framework-based alternatives, but that the measured margin materially changes what classes of models remain usable under local latency and memory constraints. This suggests a research agenda centered on hardware-specific inference runtimes as a systems layer for privacy-sensitive and low-latency LLM deployment.

6. Terminological ambiguity and a distinct earlier usage

In the supplied literature, the label “BaseRT” also appears in a different context: the summary of Garikipati and Shin’s “Scalable Real-time Transport of Baseband Traffic” uses “BaseRT” to denote the realtime baseband transport problem and the DISTRO solution rather than the Apple Silicon LLM runtime (Garikipati et al., 2017). The two uses are unrelated in subject matter, architecture, and performance target.

In that earlier communications-systems setting, the problem concerns transport of periodic baseband packet streams from radios to a backend. Radio $D_{\text{fused}}$ 8 samples at $D_{\text{fused}}$ 9 Hz, quantizes each I/Q component to $BW_{\text{mem}}$ 0 bits, and emits payload blocks of $BW_{\text{mem}}$ 1 bits per packet with period

$BW_{\text{mem}}$ 2

Each flow $BW_{\text{mem}}$ 3 is schedulable if every packet arrives at the destination within delay bound $BW_{\text{mem}}$ 4. The proposed DISTRO network is a symmetric fat-tree in which radios are leaves, edge switches run a non-preemptive realtime scheduler such as EDF or fixed-priority, aggregation switches use FIFO, uplink capacity is at least the sum of incoming capacities, and all packets have equal size $BW_{\text{mem}}$ 5 (Garikipati et al., 2017).

The delay analysis bounds queuing at the $BW_{\text{mem}}$ 6th aggregation switch by

$BW_{\text{mem}}$ 7

leading to the aggregation-delay bound

$BW_{\text{mem}}$ 8

and hence an end-to-end delay bound

$BW_{\text{mem}}$ 9

The same work couples transport schedulability to wireless capacity through

$\tau_{\text{launch}}$ 0

and

$\tau_{\text{launch}}$ 1

with optimization over quantization vectors $\tau_{\text{launch}}$ 2 subject to schedulability. The summary describes a BFS-based search that starts from the maximal quantization vector, checks schedulability, computes capacity when schedulable, and otherwise enumerates neighbors obtained by decrementing one $\tau_{\text{launch}}$ 3 by one level. Theorem 4.3 is said to guarantee optimality because capacity increases with $\tau_{\text{launch}}$ 4 while schedulability is monotonic in the opposite direction (Garikipati et al., 2017).

For encyclopedic purposes, the practical consequence is straightforward: “BaseRT” most prominently denotes the native Metal LLM runtime introduced in 2026 (Rathnayaka et al., 1 Jul 2026), but the same label is also used in the supplied summary of a 2017 real-time baseband transport problem and its DISTRO solution (Garikipati et al., 2017). Any citation to “BaseRT” therefore benefits from explicit disambiguation by paper title or arXiv identifier.

Markdown Report Issue Upgrade to Chat

References (2)

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal (2026)

Scalable Real-time Transport of Baseband Traffic (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BaseRT.