LlamaWeb: Browser-Based LLM Inference
- LlamaWeb is a browser-based system that uses WebGPU for efficient, multi-precision large language model inference with advanced quantization support.
- It employs a static memory planning strategy with a preallocated GPU arena, reducing peak memory usage by 29–33% across diverse hardware.
- Its tunable kernel library achieves up to 41% speedup through dynamic WGSL shader specialization, while its fine-tuning pipeline integrates supervised and reinforcement learning for optimal agent performance.
LlamaWeb is a system for memory-efficient, performance-portable, multi-precision LLM inference in web browsers via WebGPU. It is implemented as a backend for llama.cpp, supporting a wide range of quantization schemes and providing robust performance across diverse hardware and browser configurations. LlamaWeb is also associated with an end-to-end pipeline for fine-tuning web agents through supervised and reinforcement learning, leveraging statistical best practices for compute allocation and hyperparameter optimization (Levine et al., 20 May 2026, Vattikonda et al., 5 Jul 2025).
1. System Architecture and Components
LlamaWeb constitutes a distinct WebGPU backend module integrated within llama.cpp. Upon initialization, the llama.cpp core loader ingests a GGUF model file and constructs a directed-acyclic graph (DAG) of tensor operators, including matrix multiplications, attention, and elementwise operations. This operator DAG can then be dispatched to any enabled backend, among which LlamaWeb is designed specifically for browser-based inference.
LlamaWeb is composed of four primary modules:
- Model Loader: Employs llama.cpp’s asynchronous file-loading API to stream GGUF model weights from the browser’s Origin Private File System (OPFS) directly into GPU buffers. It maintains only four 1 MiB staging buffers in the WASM heap, preventing full-model materialization in CPU memory.
- Static Memory Planner: Performs an upfront traversal of the operator DAG to compute the byte requirements for weight tensors (floating-point or quantized), intermediate activation buffers (with support for a configurable sequence length ), the KV-cache (for keys and values, size ), and scratch space for fused kernels such as FlashAttention and sampling. A single preallocated GPU buffer ("arena") is partitioned into fixed-offset slices for each tensor, with no further allocations during inference.
- Tunable Kernel Library: This layer comprises a C++ metaprogramming system that generates WGSL shader code via the pre-WGSL preprocessor. At runtime, it specializes operators based on data type (f32, f16, q4, q8, K-quants, I-quants, etc.), workgroup size, tiling, subgroup usage, and memory layout, with each specialization producing a unique WebGPU pipeline cached for reuse.
- Execution Engine: The operator list is scheduled into WebGPU command encoders and compute passes, with batched execution to reduce CPU-GPU synchronization. A statically allocated "parameter buffer" is used for kernel arguments, eliminating per-kernel uniform buffer allocations.
The architecture achieves a separation between model logic (DAG) and hardware/algorithmic specifics ("kernel tuning"), facilitating modular development and future extensions (Levine et al., 20 May 2026).
2. Static Memory Planning
LlamaWeb addresses strict per-tab GPU memory budgets imposed by browsers by using an upfront static planning strategy. The planner calculates required memory based on the layer-wise size of:
- Weight tensors—calculated as for quantization format bits per weight.
- Activation buffers— for a given sequence length .
- KV-cache— for keys and values.
- Constant scratch space for attention/sampling.
Summing these yields the total required buffer:
All tensors are assigned to fixed regions within the single arena buffer, streamed directly into the appropriate slices during model loading. This contrasts with on-the-fly allocation approaches (e.g., in competing libraries) that incur fragmentation (10–20% memory overhead), peak allocation spikes, and risk browser-imposed termination. Empirical results demonstrate 29–33% lower peak memory usage compared to WebLLM and Transformers.js across multiple device/browser combinations (Levine et al., 20 May 2026).
3. Tunable Kernel Library and Quantization Support
LlamaWeb’s kernel library addresses hardware performance variability by exposing and tuning kernel parameters such as workgroup sizes, register-tile dimensions, and subgroup configurations. An offline sweep is performed across thousands of (WGX, WGY, Rx, Ry) parameterizations on representative GPUs (NVIDIA, AMD, Intel, Apple M-series), selecting defaults that maximize geometric-mean throughput while bounding worst-case slowdown. This approach delivers a 41% average kernel speedup over manual tuning.
A critical design decision is the template-based approach to GPU kernels. All quantization formats—including q4_0, q8_0, K-quants, I-quants, q1_0, bf16, mxfp4—are unified as buffers with associated metadata. Dequantization is performed inline in WGSL templates. For q4_0 (for example), weights are unpacked per workgroup tile, and scale/offset values are fetched to ensure accurate computation. The kernel logic is invariant across quant formats, only requiring modifications to the unpack/dequantization snippet; this extends efficiently to new formats with minimal code changes (Levine et al., 20 May 2026).
4. Fine-Tuning Pipeline and Statistical Optimization
LlamaWeb also refers to an end-to-end LLM web-agent training pipeline comprising:
- Supervised Fine-Tuning (SFT): An LLaMA 3.1 8B student is initialized via imitation of a LLaMA 3.3 70B teacher, using only successful (state, action) pairs for the SFT dataset. The objective is the standard negative log-likelihood minimization over action tokens.
- On-Policy Reinforcement Learning (GRPO): SFT checkpoints are further improved with Group-Relative Proximal Policy Optimization (GRPO), maximizing discounted returns in a web MDP setting. Sparse rewards reflect task success. Advantage normalization is performed per-goal, and zero-advantage filtering excludes low-signal updates, improving stability.
- Hyperparameter Search: Ten interacting training hyperparameters, including decoding temperature, discount factor 0, batch size, and learning rate, are tuned using random search and bootstrap confidence estimation over 1,370 configurations. This statistical approach identifies robust defaults: PLLM=0.25, 1=0.9, batch=512, lr=2, with zero-advantage filtering, grouped-advantage, and std-norm advantage all enabled. Curriculum learning and importance ratio are disabled in SFT-warm starts (Vattikonda et al., 5 Jul 2025).
A 20–30% SFT followed by 70–80% RL budget split is empirically optimal, and robust performance requires bootstrapped hyperparameter estimation to quantify uncertainty.
5. Empirical Performance and Compute-Performance Trade-Offs
LlamaWeb demonstrates strong empirical performance in both inference and end-to-end agent training tasks.
On inference:
- LlamaWeb reduces peak GPU memory usage by 29–33% relative to WebLLM and Transformers.js.
- Decode throughput is 45–69% higher than these competing browser-based frameworks on four discrete GPUs.
- On "high" class discrete GPUs, small models achieve ~3,000 tokens/sec prefill and ~100 tokens/sec decode; mid-class integrated GPUs yield ~800/40 toks/sec; mobile devices manage 4–17 toks/sec decode.
- Compared to native backends (CUDA, SYCL, Metal), LlamaWeb on native Dawn/Vulkan is within 3 and can outperform on certain hardware (e.g., Intel Arc).
- Moving from f164q8_0 yields 20–50% decode speedups; efficiency gains stagnate or reverse below 4-bit quantization due to hardware and kernel limitations (Levine et al., 20 May 2026).
On agent training:
| Model | WorkArena (goals/tasks) | MiniWoB++ (goals/tasks) |
|---|---|---|
| LLaMA-3.1-8B SFT | 28.4±2.3% / 26.4±2.2% | 53.4±2.5% / 55.6±2.5% |
| LLaMA-3.1-8B RL | 0.0% | 43.5±2.5% / 43.5±2.5% |
| LLaMA-3.1-8B SFT+RL | 34.6±2.4% / 28.0±2.2% | 66.3±2.4% / 62.9±2.4% |
| LLaMA-3.3-70B | 36.0±2.4% / 44.0±2.5% | 63.2±2.4% / 61.9±2.4% |
| GPT-4o | 42.1±3.2% / 55.7±5.9% | 65.7±2.1% / 64.3±4.5% |
The SFT+RL pipeline achieves nearly GPT-4o-level success on MiniWoB++, requiring only 55% of the compute to match pure SFT performance (efficiency 1.85 higher). For complex tasks such as WorkArena, the compute-performance frontier is improved, but success rates saturate around 40%—indicative of teacher limitations and reward sparsity (Vattikonda et al., 5 Jul 2025).
6. Design Insights and Practical Lessons
LlamaWeb surfaces several key engineering and methodological insights:
- Static memory planning is essential for avoiding fragmentation, leakage, and browser capping in resource-constrained environments.
- Separation of kernel tuning and engine logic creates a modular pathway to performance portability and incremental hardware adaptation.
- Unified quantization/dequantization logic supports a multiplicity of formats with minimal additional code, leveraging templated WGSL kernels.
- Runtime kernel compilation of 1–5 seconds (during the first forward pass) is an acceptable cost compared to the performance benefit, with subsequent pipeline use benefiting from cache hits.
- For LLM web agents, bootstrapped hyperparameter estimation and rigorous logging are vital for reproducibility and robust deployment.
Promising research directions include deeper kernel fusion, richer WebGPU features (u16 datatypes, push constants), and auto-tuner or learning-based kernel selection methods to further narrow the performance gap with native APIs. A plausible implication is that such strategies could generalize beyond browsers to other functionally portable ML environments, subject to similar resource and heterogeneity constraints (Levine et al., 20 May 2026, Vattikonda et al., 5 Jul 2025).