Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Published 20 May 2026 in cs.DC, cs.AI, and cs.LG | (2605.20706v1)

Abstract: Running LLMs in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 LLMs and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces LlamaWeb, achieving memory-efficient LLM inference through static memory planning and zero-copy buffer management.
It implements performance-portable kernels using empirical device tuning and cached pipeline compilation to optimize matrix operations across diverse hardware.
The work supports 21 quantization schemes via kernel co-design, enabling accurate, multi-precision inference in browser environments.

LlamaWeb: Portable, Memory-Efficient, and Multi-Precision LLM Inference via WebGPU

Introduction

The proliferation of consumer-grade and edge devices necessitates efficient, private, and locally executable LLMs. Browser-based inference offers an accessible, cross-platform avenue for deploying LLMs without device-specific packaging or user installation friction. However, the constraints of browser execution—including memory limitations, device heterogeneity, and the need for robust quantization support—pose significant systems and performance challenges. “Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU” (2605.20706) addresses these constraints by introducing LlamaWeb, an open-source WebGPU backend for the highly-adopted llama.cpp inference engine. The paper rigorously details technical innovations in memory management, performance-portable kernel design, and quantization-aware GPU computation.

System Design

Static Memory Planning and Model Loading

LlamaWeb statically allocates all required runtime memory at initialization, including tensor data, intermediate buffers for operations such as FlashAttention/FlashDecoding, and a fixed-size parameter buffer “arena” for runtime kernel dispatch variables. This static allocation eliminates the dynamic buffer fragmentation and unpredictable allocation behavior seen in frameworks like WebLLM and Transformers.js, and is critical for conforming to strict per-tab browser memory ceilings, especially on mobile-class systems (e.g., iOS/Safari). The model loading process is optimized to minimize redundant memory materialization by streaming model weights directly from persistent browser storage (OPFS) to WebGPU buffers, leveraging non-blocking asynchronous interfaces to avoid excess allocations in the WASM heap—a crucial consideration given its unreclaimable, grow-only semantics.

Kernel Library and Specialization Framework

Recognizing the extensive hardware and software variability across WebGPU-enabled devices, LlamaWeb factors the kernel library from runtime orchestration. All compute-intensive operations, including matrix multiplication, normalization, activation functions, rotary position embeddings, and FlashAttention variants, are implemented as templated WGSL shaders. The kernel dispatching infrastructure empirically specializes device-tunable parameters (e.g., tiling granularity, workgroup/vector widths, subgroup usage) according to detected hardware features, providing both performance portability and extensibility. Kernel code generation and pipeline compilation are cached using compile-time configuration keys, allowing amortization of specialization overhead after the first forward pass.

Quantization-Aware Computation

Unlike prior browser-based inference solutions that either restrict quantization format diversity or treat kernel-level dequantization simplistically, LlamaWeb generalizes across 21 quantization schemes supported by llama.cpp with a kernel co-design approach. Quantized model weights are treated as flat u32 buffers, with dequantization routines chosen at compile time and fused into the main compute kernels (e.g., for matmul or matrix-vector multiplication). This approach supports both legacy (q4_0/q8_0), blockwise (K-quants), vector-quantized (I-quants), and ultra-low-bitwidth (q1_0) schemes, with manual interventions only required in the addition of new dequantization routines.

Experimental Results

Benchmark Suite and Device Coverage

The LlamaWeb evaluation utilizes 10 LLMs (spanning transformers, SSMs, and 1-bit architectures) and 16 devices across eight vendors (NVIDIA, AMD, Intel, Apple, ARM, Qualcomm, Samsung, Imagination Tech), encompassing both desktop and resource-constrained mobile hardware. Cross-comparisons are conducted against existing browser LLM frameworks (WebLLM, Transformers.js), browser-native backends, and system-native backends (CUDA, Vulkan, Metal, HIP, SYCL).

Memory Consumption

LlamaWeb achieves peak memory usage reductions of 29–33% over WebLLM and Transformers.js (geometric mean across device-browser-OS tuples), attributable to static allocation and zero-copy buffer management. On Safari/iOS (the most constrained class), LlamaWeb enables inference with state-of-the-art models that crash or run unsatisfactorily in baseline engines. Memory leaks were observed in both competing frameworks under certain configurations, but not in LlamaWeb.

Throughput and Performance-Portability

LlamaWeb’s median decode token-per-second (tok/s) throughput exceeds WebLLM by 54% and Transformers.js by 69% across four major GPUs. On certain device/backend combinations—particularly on Intel GPUs with the WebGPU backend—LlamaWeb surpasses even native system libraries. The portable kernel tuning approach enables this performance parity, despite browser-imposed runtime safety checks, with only a moderate (~14–23%) slowdown compared to unchecked native execution.

Perf-portability was quantitatively validated via k-means device clustering; LlamaWeb adapts to high-, mid-, and low-tier GPUs, delivering substantial absolute throughput (e.g., 3k tok/s prefill and 100 tok/s decode for smaller models on RTX 5080; 4–17 tok/s decode on mobile GPUs). Notably, in the prefill phase, LlamaWeb is outperformed by WebLLM and Transformers.js due to the latter’s application of kernel fusion and subgroup-matrix optimizations; LlamaWeb’s portable kernels do not currently exploit subgroup-matrix in browser, but do achieve competitive scores when enabled natively.

Quantization Format Scaling

Templated dequantization kernels enable near-optimal decode throughput for high-utility quantization formats (e.g., q4_k_m, q8_0). For low-bitwidth quantization (q2_k), the performance advantage diminishes and can even reverse due to increased dequantization overhead relative to the saved memory bandwidth; decode throughput increases 20–53% when moving from f16 to q8_0, but regresses by ~17% for q2_k on mid-tier GPUs. Prefill performance is relatively invariant to weight format. LlamaWeb’s quantization-aware design delivers uniform and predictable scaling as KV-cache grows, validating its robustness for long-context inference.

Practical and Theoretical Implications

Deployment and Ecosystem Impact

LlamaWeb extends the GGUF/open-weight LLM ecosystem to the browser in a memory- and performance-efficient manner. This enables accessible, private, and portable deployment of the substantial llama.cpp model zoo—including recent agentic and multilingual architectures—on any device with a modern browser. The browser’s ubiquity removes installation and configuration barriers, democratizing access to LLM technology and expanding privacy-respecting computation beyond server-centric architectures.

Theoretical Directions

The successes and bottlenecks observed suggest multiple avenues for future research:

Dynamic Kernel Fusion: Integrating kernel fusion techniques, as done in TVM, to minimize kernel launch/dispatch overhead can yield substantial further improvements, especially in the prefill phase.
Auto-tuning and Meta-learning for Performance Portability: Incorporation of meta-parameter optimization, drawing from auto-tuning frameworks, can refine device-specific performance further.
Deeper Co-Design of Quantization and Hardware: The limits of portable quantization-aware kernels are set by the dequantization compute/memory tradeoff; deeper integration with subgroup and tensor-core hardware or evolving the quantization formats themselves for hardware-awareness is a critical area for both LLM and systems research.
WebGPU Specification and Implementation: Addressing runtime safety check overheads, specialization for sub-16b integer types (u16/u8), and improved floating-point semantics will be important for the maturation of browser compute APIs as viable LLM deployment targets.

Conclusion

“Llamas on the Web” establishes LlamaWeb as a reference architecture for browser-based LLM inference, demonstrating that aggressive systems-level design—static memory planning, kernel specialization, and quantization-aware execution—enables significant gains in both efficiency and portability without sacrificing numerical correctness or extensibility. LlamaWeb sets a new operational baseline for quantized LLM deployment within browsers, and provides a flexible platform for future research in cross-device LLM optimization and quantized inference systems (2605.20706).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (big picture)

The paper introduces LlamaWeb, a way to run AI LLMs directly inside your web browser using your computer’s graphics chip (GPU). The goal is to make browser AI apps fast, memory‑friendly, and private (no data has to leave your device), even though browsers run on many different kinds of hardware with tight memory limits.

What the researchers wanted to figure out

In simple terms, they asked:

How can we run useful LLMs in the browser without running out of memory or crashing tabs?
How can we make performance “travel well” so it’s fast on many different GPUs (from phones to laptops to desktop PCs)?
How can we support lots of model “sizes” and compression styles (called quantization) so more models fit and run well?

How they approached it (methods explained simply)

Think of running a model like cooking a big meal in a tiny kitchen:

Memory planning: Instead of grabbing pans and bowls on the fly (which can cause clutter), they lay out all the cookware they’ll need at the start and reuse it. In computer terms, they “statically allocate” GPU memory when the page loads, so the browser doesn’t have to keep asking for more space later.
Smart loading: They stream the model’s “ingredients” (weights) straight from the browser’s storage to the GPU, skipping extra copies. That’s like carrying groceries directly to the fridge instead of piling them on the counter first.
A tunable kernel library: The “recipes” that run on the GPU (called kernels) are written so they can be tuned for different devices. If the GPU has special features, they use them; if not, they fall back to a portable version. This helps the same code run well on many GPUs.
Built‑in support for many quantization formats: Quantization is like compressing numbers so the model takes less space (imagine shrinking files to save storage). They designed their GPU code to “decompress” (dequantize) these numbers on the fly while computing, which keeps things fast and memory‑efficient.
Browser technologies: They implemented this using WebGPU (the browser’s GPU API) and integrated it with llama.cpp (a popular engine that supports many models). They also used WebAssembly and a small WGSL “preprocessor” to generate fast GPU programs and cache them.

They tested LlamaWeb on:

16 devices from 8 different vendors (from phones to desktops).
10 different LLMs.
Multiple model formats (including 16‑bit floats and many quantized versions).
Multiple browsers (mainly Chrome; Safari on iOS).

They measured two phases of running a model:

Prefill: reading the whole prompt (heavy on big matrix multiplications).
Decode: generating tokens one by one (heavy on matrix‑vector multiplications).

What they found (main results and why they matter)

Lower memory use in browsers:
- LlamaWeb used about 29–33% less peak memory than other browser frameworks (like WebLLM and Transformers.js) across several devices and browsers.
- This matters because browser tabs often have strict memory limits; using less memory means fewer crashes and the ability to run larger models.
Faster token generation (decode phase):
- LlamaWeb increased decode speed by about 45–69% on average across four different GPUs compared to existing browser frameworks.
- Faster decode means snappier, more responsive chat or writing experiences.
Competitive with native backends:
- When compared to running models using vendor‑specific native backends (outside the browser), LlamaWeb was often competitive and even faster on some devices. That’s notable because browsers are usually at a disadvantage compared to native apps.
Broad model and format support:
- Because it builds on llama.cpp and supports many quantization formats, it can run a huge number of models (hundreds of thousands available in the llama.cpp GGUF format).
- This flexibility helps developers and users pick models that fit their device’s memory and speed.
One caveat: slower prefill in the browser
- During the initial “prefill” step, LlamaWeb was 21–51% slower than some alternatives in the browser. This is an area they plan to improve.

Why this matters (impact and future directions)

Privacy and portability: Running models locally in your browser means your text doesn’t have to be sent to a server, which can be better for privacy. It also works across many devices without installing extra software.
Efficiency on everyday hardware: By using memory carefully and supporting compression (quantization), more models can run on phones, tablets, and laptops—making on‑device AI more accessible.
A foundation for better web AI: Their tunable kernel library and support for many formats make it easier for others to add new models or compression methods. As browsers add more GPU features, LlamaWeb can get even faster.
What’s next: Improve prefill speed, add smarter auto‑tuning for different devices, and keep expanding support for new quantization techniques and GPU capabilities.

In short, this work shows that powerful AI models can run efficiently and privately inside the browser across many devices—bringing fast, install‑free AI closer to everyone.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following issues unresolved and open for future research:

Prefill performance deficit: LlamaWeb underperforms existing frameworks by 21–51% during prefill; the causes (e.g., matmul tiling, fusion opportunities, memory traffic) and effective remedies remain uncharacterized.
Lack of online auto-tuning: Kernel parameters are set via a one-time empirical study on four GPUs; no per-device, per-browser auto-tuner or low-overhead runtime adaptation is provided or evaluated.
Experimental feature reliance: Subgroup-matrix (tensor-core–like) paths are not available in stable browsers; the performance and portability plan as these features roll out (and robust fallbacks) is not systematically assessed.
GPU-side sampling: Offloading single-token sampling showed no gains; it remains open how to design fused projection+sampling, numerically stable temperature/top-k/top-p kernels, and data-movement–aware pipelines that actually outperform CPU sampling in browsers.
No batching or speculative decoding: The system targets single-sequence, low-latency inference; browser-suitable batching, speculative decoding, or chunked prefill strategies to improve throughput are unexplored.
Long-context behavior: Evaluation covers KV depths 0 and 2048 only; performance, memory, and stability for much longer contexts (e.g., >32k), paged KV-cache designs, and RoPE scaling in WebGPU remain untested.
KV-cache quantization scope: Only a subset of KV-cache formats (e.g., q4_0, q8_0) is supported/evaluated; the quality–throughput trade-offs of more aggressive or mixed-precision KV quantization across devices are not studied.
Numerical stability and precision choices: f16 accumulation caused incoherent outputs on some Apple GPUs, prompting f32 accumulation; a systematic cross-device analysis of accumulation precision, rounding modes, and end-to-end generation quality is missing.
Determinism and reproducibility: The impact of browser/driver differences on numerical determinism and token-level reproducibility is not evaluated.
WGSL type limitations: Emulating formats via u32 buffers (due to absent u8/u16/bf16/nvfp4 types) may inflate bandwidth; the quantitative overhead and migration plan when these types land in WGSL are not analyzed.
Limited hardware-native format support: Only mxfp4 emulation is implemented; broader support (e.g., bf16, nvfp4) and their performance/quality trade-offs in WebGPU remain open.
Quantization–kernel co-design depth: While templated dequantization is integrated, deeper co-design (e.g., subgroup-friendly packing, codebook layouts for I-quants, per-arch bit-layouts) and its measured benefits are not explored.
Data layout documentation: The paper notes a lack of formal documentation for several quant formats; creating, validating, and standardizing layouts optimized for WebGPU alignment/vectorization is an open task.
Compile-time overhead and TTFT: Kernel compilation adds ~1–5 s on the first forward pass; its impact on time-to-first-token, pipeline caching across sessions (e.g., OPFS-backed), and cross-device variance is not measured.
Queueing and pass scheduling: Heuristics group operations into compute passes; optimal queue submission granularity, in-flight command depth, and cross-implementation tuning are not studied.
Static memory planning limits: Fixed arenas and pre-allocated intermediates improve stability but may constrain very long sessions, concurrent models, or dynamic workloads; strategies for safe resizing, eviction, or backpressure are not provided.
Parameter-buffer semantics: The slot-rotation guarantee (“not overwritten until finished”) lacks a formal analysis under deep pipelining, multiple workers, or concurrent graphs; detection and recovery on misuse/device loss are unspecified.
Browser and backend breadth: Firefox is omitted due to poor performance; concrete root-cause analysis and co-development with wgpu to close the gap—and broader OS/driver matrix coverage—are absent.
Mobile constraints and thermals: iOS/Safari hard caps limited model sizes; sustained performance under thermal throttling, battery impact, and mobile power/per-token measurements are not reported.
Energy efficiency: Despite motivation around local energy benefits, no power or energy-per-token measurements are provided across devices.
Model coverage and robustness: Only 10 models are evaluated out of >170k GGUF-compatible models; large-scale compatibility, long-tail operator coverage, and soak testing (e.g., hours-long sessions) are not presented.
SSM/convolution support evaluation: Kernels exist for SSMs/convolutions, but their performance characteristics, tuning needs, and comparisons to transformer workloads are not benchmarked.
MoE and dynamic shapes: Mixture-of-experts routing, expert parallelism, and dynamic-shape kernels (and their scheduling/memory implications) are not addressed.
Error handling and resilience: Behavior under memory pressure, tab eviction, GPU device loss, or browser-initiated preemption—and strategies for graceful degradation or recovery—are unspecified.
Security and privacy of OPFS models: Model caching lacks discussion of encryption at rest, integrity checks, provenance, and defenses against supply-chain or cross-origin abuse.
Guidance for backend selection: While WebGPU sometimes matches or beats native backends, a diagnostic framework to predict when WebGPU will underperform (and why) is not provided.

View Paper Prompt View All Prompts

Practical Applications

Overview

LlamaWeb delivers a WebGPU backend for llama.cpp that enables memory-efficient, performance-portable, multi-precision LLM inference directly in the browser and native WebGPU environments. Its static memory planning, templated/tunable kernels, and broad quantization support lower peak memory by 29–33% and increase decode throughput by 45–69% over existing browser frameworks, while unlocking 23 weight formats and 170k+ GGUF models. Below are actionable applications that leverage these results across industry, academia, policy, and daily life.

Immediate Applications

The following applications can be deployed now using LlamaWeb (in-browser) or the same backend natively via Dawn; they assume current WebGPU availability on Chrome (and Safari on iOS), and models sized to device memory.

Private, on-device chat and writing assistants in the browser — (software, education, enterprise)
- What: Embed a “no-cloud” assistant in web apps, intranets, LMS portals, and Electron/PWA shells for drafting, summarizing, and code help with data staying on-device.
- Tools/products/workflows: Web SDK wrapping wllama + LlamaWeb; OPFS-backed model manager; per-device model selection (e.g., q4_k_m 1–3B); enterprise Electron app using Dawn for native WebGPU.
- Assumptions/dependencies: WebGPU enabled; enough VRAM/system memory (e.g., 0.3–3 GB model files); model download time (served via CDN/OPFS); prompt lengths where decode dominates (prefill is currently slower than some alternatives); Firefox performance not yet competitive.
Customer support and knowledge widgets without server round-trips — (software, finance, retail, public sector)
- What: Site-embedded FAQ/chat agents that process internal content locally for privacy and cost savings.
- Tools/products/workflows: Widget that streams GGUF models into OPFS; client-side RAG via local embeddings and KV-cache; per-session long-context via KV quantization to stretch local context.
- Assumptions/dependencies: Content pre-indexed to local storage; careful prompt engineering to manage prefill cost; regulatory review where needed.
Classroom and Chromebook-ready tutoring tools — (education)
- What: School-safe assistants that run entirely in-browser on managed devices, with no data egress.
- Tools/products/workflows: PWA tutors (math, reading, language) with offline mode; model policies curated by grade; device-aware quantized model selection (q2_k/q4_k_m/q1_0 on low-memory devices).
- Assumptions/dependencies: IT whitelists WebGPU; model sizes compatible with school hardware; adherence to FERPA/COPPA.
Field service and offline productivity apps — (energy, manufacturing, logistics)
- What: On-site assistants on rugged laptops/tablets for checklists, procedure guidance, and report drafting without connectivity.
- Tools/products/workflows: PWA with OPFS model cache; job pack download at depot, inference in the field; quantized models for battery efficiency.
- Assumptions/dependencies: Initial model download bandwidth; sufficient device memory; Chrome preferred for performance.
Healthcare note drafting and patient-education aids (local only) — (healthcare)
- What: Browser-based note suggestions and patient-instruction generation on hospital devices to reduce documentation time while keeping PHI local.
- Tools/products/workflows: Hospital intranet app using LlamaWeb; audit logs confirming no network egress; model catalogs restricted to vetted medical-tuned small models.
- Assumptions/dependencies: Institutional review and HIPAA compliance; managed devices; legal sign-off; limited to inference (no training).
Secure government/enterprise intranet assistants — (public sector, defense, enterprise IT)
- What: Privacy-preserving LLM utilities on air-gapped or restricted networks without cloud dependency.
- Tools/products/workflows: On-prem static site hosting GGUF models via internal CDN; Electron desktop assistant using native WebGPU (Dawn).
- Assumptions/dependencies: Device GPU support; model governance and software supply-chain vetting.
In-browser research harness for cross-device LLM benchmarking — (academia, systems research)
- What: Reproducible, large-scale measurement across browsers, OSes, and GPUs, including mobile.
- Tools/products/workflows: Use the paper’s harness to test new kernels/quant formats, record perf/memory across 16+ device types; publish datasets for systems papers.
- Assumptions/dependencies: Unified harness + CI farm with device variety; stable Chrome WebGPU; IRB not required (no user data).
Rapid prototyping of new quantization formats on WebGPU — (ML research)
- What: Plug-in dequantization routines via LlamaWeb’s templated kernels to evaluate accuracy–performance tradeoffs.
- Tools/products/workflows: Extend kernel templates with new block layouts/codecs; evaluate prefill/decode on representative devices; iterate without kernel rewrites.
- Assumptions/dependencies: WGSL limitations (no u16/u8 types yet); format-specific packing in u32 buffers; correctness checks via llama.cpp tests.
Browser-integrated accessibility helpers — (public interest, daily life)
- What: Local summarization, rephrasing, and reading simplification in assistive extensions without sending sensitive browsing content to servers.
- Tools/products/workflows: Extension content scripts calling a PWA LLM service using OPFS; model selection by device capability.
- Assumptions/dependencies: Extension store policies; memory footprint acceptable on user devices.
Edge kiosks and robots with web UIs — (robotics, retail, transportation)
- What: On-device voice/chat interfaces on kiosks/robots that already host Chromium/WebKit UIs.
- Tools/products/workflows: Local LLM powering task dialog and status explanations, integrated with ROS/edge apps through web messaging.
- Assumptions/dependencies: Real-time speech pipeline handled separately; memory budgets; long prompts tuned to avoid prefill bottlenecks.
Developer tooling: pre-wgsl and kernel-tuning starter kits — (software tooling)
- What: Use pre-wgsl to manage WGSL variants and launch kernel sweeps to pick performance-portable defaults per device class.
- Tools/products/workflows: CI jobs that autotest tile sizes/workgroup configs across a device matrix; publish per-segment defaults.
- Assumptions/dependencies: Access to device lab; Chrome/Dawn feature parity; subgroup features availability varies.

Long-Term Applications

The following require further research, browser features, broader hardware support, or engineering scale-up.

Near-native browser inference via stable subgroup matrix ops — (software, all sectors)
- What: Once subgroup matrix/tensor-core features stabilize across browsers, approach native CUDA/Metal performance for prefill and larger models.
- Tools/products/workflows: Automatic path selection to tensor-core variants (sg_mat, subgroup attention); fallback to portable paths.
- Assumptions/dependencies: WebGPU subgroup matrix standardization; driver coverage; spec/test maturity; security review in browsers.
Auto-tuning and per-device optimization at runtime — (software tooling, academia)
- What: On-device tuning of workgroup sizes, tiling, and vectorization to maximize throughput and reduce variance across heterogeneous GPUs.
- Tools/products/workflows: First-run microbenchmarks to pick configs; persisted per-device profiles; safe adaptive re-tuning.
- Assumptions/dependencies: Time budget for tuning; user consent; stable perf counters/telemetry APIs.
Richer low-bit and hardware-native formats in-browser — (ML systems, model serving)
- What: Extend to bf16, nvp4, mxfp4 (beyond emulation) and u16/u8 types as WGSL evolves; unlock larger models and better perf/W.
- Tools/products/workflows: Hybrid pipelines mixing native low-bit intrinsics and dequant templates; format-aware KV-cache compression.
- Assumptions/dependencies: WGSL type additions; browser and driver support; conformance tests.
Client-side fine-tuning and adapters (LoRA/QLoRA-lite) — (education, enterprise, research)
- What: Lightweight personalization and domain adapters trained or merged in-browser for private customization.
- Tools/products/workflows: LoRA merging pipelines in WASM/WebGPU; OPFS adapter catalogs per user/team.
- Assumptions/dependencies: Memory/time budgets; training stability at browser precision; UX around battery/thermal limits.
Fleet-scale edge deployments and MDM integration — (public sector, healthcare, enterprise IT)
- What: Manage model catalogs, updates, and compliance on thousands of endpoints (kiosks/laptops) with device-aware model targeting.
- Tools/products/workflows: MDM policies distributing GGUF models; telemetry on-device only; rules to enforce no network egress.
- Assumptions/dependencies: Organizational IT maturity; content delivery at scale; legal/privacy frameworks.
Standardized privacy/compliance profiles for local LLMs — (policy, governance)
- What: Procurement and regulatory frameworks that endorse local, browser-based inference for sensitive data processing.
- Tools/products/workflows: “No data leaves device” attestations; audit logs; repeatable tests ensuring network isolation.
- Assumptions/dependencies: Regulator acceptance; standardized test suites; independent certification programs.
Energy-aware inference and reporting — (energy, sustainability)
- What: Device-level energy/battery-aware scheduling, quantization selection, and reporting to reduce emissions vs. cloud inference.
- Tools/products/workflows: Power telemetry APIs; adaptive precision selection; sustainability dashboards.
- Assumptions/dependencies: Browser energy APIs; standardized methods to attribute energy; user consent.
Web-based benchmarking consortia and conformance tests — (academia, standards)
- What: Community-maintained test suites for WebGPU LLM inference across devices, ensuring perf/accuracy portability.
- Tools/products/workflows: Open datasets, dashboards, and automated nightly runs across device farms.
- Assumptions/dependencies: Vendor participation; CI infrastructure; evolving spec features.
Trusted and verifiable on-device inference — (security, finance, public sector)
- What: Combine WebGPU with attestation/TEE or verifiable computation to prove local-only, untampered inference.
- Tools/products/workflows: Browser/OS attestation hooks; signed GGUF manifests; reproducibility checks.
- Assumptions/dependencies: OS/browser support; performance overhead; key management.
Browser-based multimodal assistants leveraging the same backend — (healthcare, education, retail)
- What: Extend memory-efficient kernels to vision/audio pre/post-processing for multimodal tasks (e.g., forms parsing, voice agents).
- Tools/products/workflows: WebGPU-friendly CNN/ViT kernels; streaming audio tokenization; on-device ASR/TTS integration.
- Assumptions/dependencies: Additional kernels/operators; increased memory demands; DSP/AudioWorklet pipelines.
Quantization–kernel co-design “codecs” — (ML research)
- What: New quantization layouts co-optimized with subgroup/shared-memory execution for further gains in decode and attention.
- Tools/products/workflows: Automated search over bit-packings vs. memory alignment; format generators with correctness harnesses.
- Assumptions/dependencies: WGSL feature evolution; portable vs. device-specific tradeoffs; accuracy validation on downstream tasks.

Cross-cutting assumptions and dependencies

Browser support: Best results on Chrome/Dawn today; Safari required on iOS (tight per-tab memory). Firefox performance currently lags.
Feature availability: Subgroup and subgroup-matrix operations not universally available in stable browsers; WGSL lacks u16/u8 types.
Memory constraints: Large prompts (prefill-heavy) may underperform vs. some frameworks; model size must fit device memory (e.g., iOS <500 MB practical).
Distribution: Initial model downloads (hundreds of MB to a few GB) require CDN and resumable caching in OPFS.
Security/compliance: Local inference simplifies privacy but still needs supply-chain vetting and organizational policies.
Workload fit: Decode-dominated interactive chat benefits most immediately; long-context analytics may require optimization or smaller models.

These applications leverage LlamaWeb’s concrete advances—static memory planning, efficient in-browser model loading, performance-portable kernels, and broad quantization support—to make private, portable LLMs practical today and pave the way for faster, richer, and more standardized on-device AI in the browser.

View Paper Prompt View All Prompts

Glossary

AWQ: Activation-aware weight quantization method used to improve post-training quantization quality. "techniques from other post- training quantization methods like GPTQ [13] and AWQ [36]"
bf16: Bfloat16, a 16-bit floating-point format with wider exponent than f16 for better range. "hardware-native formats like bf16, nvp4, and mxfp4."
bind groups: WebGPU resource binding objects that group buffers/textures for pipeline access. "Data is stored in buffers allocated by the host and organized into bind groups which are bound to pipelines according to specified bind group layouts."
command encoders: WebGPU objects that record GPU commands before submission. "compute workloads are recorded via command encoders and compute passes"
compute passes: Sections within a command encoder used to record compute workloads. "compute workloads are recorded via command encoders and compute passes"
CUDA: NVIDIA’s GPU programming platform and API. "including a CUDA backend for NVIDIA GPUs"
DAG: Directed acyclic graph; an operation graph without cycles used for scheduling. "This directed acyclic graph (DAG) is then linearized into a topological ordering"
Dawn: Chromium’s native WebGPU implementation/library. "Dawn [16] is used in Chromium-based browsers"
deferred execution model: Programming model where commands are recorded and executed later upon submission. "WebGPU follows a deferred execution model; compute workloads are recorded via command encoders and compute passes"
double quantization: Technique that quantizes scaling factors themselves to further reduce storage. "they are similar to the double quantization technique introduced in QLoRA [11]."
Emscripten: Toolchain that compiles C/C++ to WebAssembly for web environments. "Tools like Emscripten [68] enable cross- compilation of C++ to WebAssembly (WASM)"
FlashAttention: IO-aware fused attention algorithm that reduces memory traffic by avoiding materialization. "as well as for FlashAttention and sampling operations like top-k"
FlashDecoding: Attention variant optimized for decode phase by streaming over KV-cache with online softmax. "implement FlashDecoding [9]"
GGUF: llama.cpp’s custom binary format for model weights and metadata. "model weights are stored in a custom binary format (GGUF)"
GLU (gated linear units): Neural network layer that uses gating to modulate activations. "important LLM operations like gated linear units (glu)"
HIP: AMD’s GPU programming framework similar to CUDA. "a HIP backend for AMD GPUs"
I-quants: llama.cpp quantization family using codebook indices (vector quantization-inspired). "I-quants: These formats also use blocks of 256 weights, but are inspired by vector quantization"
K-quants: Quantization format with per-32 subblocks and a super-block quantized scaling factor. "K-quants: These formats use blocks of 256 weights"
KV-cache: Cache of key/value tensors storing past tokens for efficient attention during generation. "can be stored in a KV-cache"
MLC-LLM: WebGPU/ONNX-based runtime for LLMs used by some browser frameworks. "supporting GPU acceleration through ONNX Runtime [44] and MLC-LLM [46], respectively."
mxfp4: A 4-bit floating-point format (MXFP4) that can be emulated in software. "hardware-native formats like bf16, nvp4, and mxfp4."
NMSE: Numerical mean squared error; metric to validate GPU vs CPU outputs. "ensuring the numerical mean squared error (NMSE) across the output remains under a threshold"
ONNX Runtime: Inference engine for ONNX models with GPU acceleration paths. "supporting GPU acceleration through ONNX Runtime [44]"
OPFS: Origin Private File System, a browser API for per-origin persistent storage. "wllama includes features like caching models in the browser's Origin Private File System (OPFS) [47]"
QLoRA: Method for efficient fine-tuning/quantization that introduced double quantization. "double quantization technique introduced in QLoRA [11]."
q1_0: 1-bit symmetric quantization format with a single scale per 128-weight block. "a quantization format called q1_0 was introduced to support the Bonsai 1-bit model family"
q4_k_m: A specific K-quants variant used widely in llama.cpp models. "we specifically use q4_k_m model variants for all llama.cpp q4_k evaluation."
q8_0: Legacy 8-bit symmetric per-32-block quantization format. "Variants like q4_0 and q8_0 are symmetric"
rope (rotary position embedding): Positional encoding technique for transformers using rotations. "rotary position embed- ding (rope)"
shared memory: Fast on-chip memory shared among threads in a workgroup. "workgroups share fast on-chip memory (analogous to shared memory in CUDA)"
SIMT: Single Instruction, Multiple Threads; a GPU execution model for parallel threads. "general-purpose compute in a Single Instruction, Multiple Threads (SIMT) execution model"
subgroup matrix feature: Experimental WebGPU feature enabling cooperative matrix ops on specialized units. "WebGPU's newer subgroup matrix feature"
subgroups: Smaller thread collections (warp-like) enabling cooperative register-level operations. "Recent extensions to the WebGPU programming model include subgroups (analogous to warps in CUDA)"
tensor cores: Specialized hardware units for fast matrix multiplication on modern GPUs. "e.g., tensor cores"
top-k: Sampling strategy that restricts selection to the k highest-probability tokens. "sampling operations like top-k and argmax."
Vulkan: Cross-platform low-level GPU API used by WebGPU implementations/backends. "Vulkan on Linux and Android [20]"
WGSL: WebGPU Shading Language for writing compute/graphics shaders. "kernels (called shaders in WebGPU) are writ- ten in the WebGPU Shading Language (WGSL)"
Web Worker: Browser background worker thread used to avoid blocking the UI. "running WebGPU in a Web Worker to avoid freezing the UI thread."
WebAssembly (WASM): Portable low-level bytecode format for the web and native environments. "WebAssembly (WASM)"
WebGPU: Modern browser and native GPU API providing compute and graphics capabilities. "WebGPU [63] enable GPU acceleration within the browser"
wllama: Library for running llama.cpp in browsers, extended here with WebGPU support. "wllama includes features like caching models in the browser's Origin Private File System (OPFS) [47]"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Summary

LlamaWeb: Portable, Memory-Efficient, and Multi-Precision LLM Inference via WebGPU

Introduction

System Design

Static Memory Planning and Model Loading

Kernel Library and Specialization Framework

Quantization-Aware Computation

Experimental Results

Benchmark Suite and Device Coverage

Memory Consumption

Throughput and Performance-Portability

Quantization Format Scaling

Practical and Theoretical Implications

Deployment and Ecosystem Impact

Theoretical Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What the researchers wanted to figure out

How they approached it (methods explained simply)

What they found (main results and why they matter)

Why this matters (impact and future directions)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Summary

LlamaWeb: Portable, Memory-Efficient, and Multi-Precision LLM Inference via WebGPU

Introduction

System Design

Static Memory Planning and Model Loading

Kernel Library and Specialization Framework

Quantization-Aware Computation

Experimental Results

Benchmark Suite and Device Coverage

Memory Consumption

Throughput and Performance-Portability

Quantization Format Scaling

Practical and Theoretical Implications

Deployment and Ecosystem Impact

Theoretical Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What the researchers wanted to figure out

How they approached it (methods explained simply)

What they found (main results and why they matter)

Why this matters (impact and future directions)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research