Papers
Topics
Authors
Recent
2000 character limit reached

GGUF Format: Unified Quantized Model File

Updated 13 December 2025
  • GGUF Format is a single-file, memory-mapped binary designed for efficient packaging and deployment of post-training quantized large language models.
  • It consolidates model weights, quantization parameters, metadata, and configuration details, utilizing blockwise k-quant schemes to minimize reconstruction loss.
  • The format enables rapid lazy-loading and optimized alignment, offering significant compression and practical mobile inference without separate files.

GGUF (GGer­ganov’s Unified Format) is a memory-mapped, single-file binary format designed for efficient packaging, distribution, and deployment of post-training–quantized LLMs, particularly in resource-constrained and mobile execution environments. Originating as the native format for the llama.cpp and Ollama frameworks, GGUF consolidates model weights, quantization parameters, and all relevant model metadata—including tokenizer and configuration details—within a single file, optimized for both inference efficiency and portability. The design accommodates state-of-the-art blockwise quantization schemes, supports explicit sub-byte types (down to 2–4 bit precision), and enables rapid lazy-loading on a spectrum of device platforms (Yadav et al., 6 Dec 2025, Egashira et al., 24 May 2025).

1. Logical Structure and File Organization

A GGUF file consists of the following major sections in tightly specified binary order (Yadav et al., 6 Dec 2025, Egashira et al., 24 May 2025):

  • Header: Identified by a four-byte ASCII magic string (“GGUF”), followed by a strict 32-bit version integer.
  • Global Metadata: A sequence of length-prefixed key-value pairs, covering model architecture (e.g., “n_embd”, “n_layer”, “n_head”), tokenizer, quantization type (e.g., “q4_k_m”), and checksums (e.g., SHA256, CRC32). Endianness for all multi-byte fields is little-endian.
  • Tensor Blocks: Each tensor is described by a metadata header (name, shape, quantization type, block sizes, disk offsets) and followed by its quantized weights (blockwise packed), quantization tables, and associated scaling/folding constants. All tensor metadata is padded to 8 bytes; tensor data to 16 bytes to assure optimal mmap and cache-line alignment.
  • Optional Sections: User-defined blobs and per-file integrity fields.

This single-file, memory-mappable approach stands in contrast to conventional PyTorch-based deployables, which require separate files for weights, configuration, and vocabularies (Yadav et al., 6 Dec 2025).

Section Description Padding/Alignment
Header Magic string, version
Metadata Key-value fields (model/tok/specs/quant details) 8 bytes
Tensor Blocks Quantized weights, scale/min tables, parameters 16 bytes (data), 8 (meta)
Optional Blobs User fields, checksums 8–16 bytes as needed

2. Quantization Scheme Support and Algorithms

GGUF supports multiple quantization schemes, with particular specialization for blockwise “k-quant” families targeting 2–6 bit per parameter granularity (Egashira et al., 24 May 2025). The principal types include:

  • k-quant (Weight-Optimized): Q2_K, Q3_K_S/M/L, Q4_K_S/M, Q5_K_S/M, Q6_K. S, M, and L designate the fractions of network layers mapped to higher-precision.
  • Double Quantization: Each weight block (typically 256 parameters, arranged as m subblocks of n dimensions) is approximated via a local affine transform using per-subblock scale and offset, further quantized to reduce storage overhead.
  • Zero/One/i-quants: Also supported, but less prevalent in practice.

For k-quant, the procedure is:

  1. Partition each tensor into superblocks (XRm×nX \in \mathbb{R}^{m \times n}, mn=256m n = 256).
  2. Compute optimal Scale[i] and Min[i] via grid/regression to minimize L2L_2-reconstruction loss (Algorithm 2 in (Egashira et al., 24 May 2025)).
  3. Quantize each entry:

Qi,j=round(Xi,jMin[i]Scale[i])Q_{i,j} = \operatorname{round}\left(\frac{X_{i,j} - \mathrm{Min}[i]}{\mathrm{Scale}[i]}\right)

  1. Double-quantize scales/mins into small integer tables (QscalesQ_{scales}, QminsQ_{mins}) plus global floats (dscales, dmins).
  2. Store Qscales, Qmins, Q as tightly packed NN-bit unsigned integers.

Blockwise quantization reduces root-mean-square error 2–4× relative to naive (unstructured) rounding, enabling sub-5 GB LLMs to retain high functional accuracy (Egashira et al., 24 May 2025).

3. Conversion Workflow and Command-Line Processes

The canonical workflow for producing and using a 4-bit GGUF model with llama.cpp is as follows (Yadav et al., 6 Dec 2025):

  • Environment preparation: Compile llama.cpp and necessary scripts.
  • PTQ Execution: Perform Post-Training Quantization (PTQ) using BitsAndBytes within the Hugging Face Transformers framework to generate a 4-bit weight checkpoint.
  • Format Conversion and Quantization:

1
2
python3 tools/convert-hf-to-llama.py --hf-model llama-3b-4bit-nf4 --output llama-3b-4bit-nf4.raw
python3 tools/quantize.py llama-3b-4bit-nf4.raw llama-3b-q4_k_m.gguf q4_k_m

  • Model Packaging:

1
zip llama-3b-gguf-q4km.zip llama-3b-q4_k_m.gguf

  • Mobile Deployment: Transfer ZIP to device (adb/cloud). Use Termux to unpack and compile llama.cpp.
  • Inference:

1
./main -m /path/to/llama-3b-q4_k_m.gguf -p "prompt text" -n 64

  • Serving: Optional hosting via Ollama CLI for networked inference sessions.

This pipeline enables a quantified model (e.g., 1.88 GB for Llama 3.2 3B, q4_k_m, a 68.66% reduction from BF16) to run inference on Android hardware without further modifications (Yadav et al., 6 Dec 2025).

4. Empirical Performance, Accuracy, and Metrics

Empirical results for Llama 3.2 3B quantized to GGUF (q4_k_m) demonstrate the following (Yadav et al., 6 Dec 2025):

  • Compression Ratios: Original BF16: 6.00 GB; BitsAndBytes 4-bit (nf4): 2.10 GB (64.92% reduction); GGUF q4_k_m: 1.88 GB (68.66% reduction).
  • Accuracy: MMLU score drops modestly from 64.2% (original) to 61.8% (GGUF q4_k_m).
  • Perplexity/BLEU: WikiText-2: 8.57 ± 0.06; DailyMail BLEU: 0.45.
  • Mobile Usability: On a Snapdragon 750G (OnePlus Nord CE), the model loads <10 seconds and generates 4–6 tokens/second on a single core (not systematically benchmarked).
  • Qualitative Usability: The quantized model remains “usable for non-real-time tasks” on mobile (Yadav et al., 6 Dec 2025).

5. On-Disk Layout and Loader Pseudocode

The on-disk representation in GGUF is strictly specified to support mmap and direct, alignment-efficient parsing (Egashira et al., 24 May 2025):

  • Header: 4-byte “GGUF” + 4-byte version.
  • Global Metadata: Length-prefixed; includes model/tok/quant fields.
  • Per-Tensor:
    • Name, dim vector, quant type (“Qx_K”), block sizes, data offsets/lengths, all padded/aligned.
    • Data region: floats (dscales, dmins), quant tables (Qscales, Qmins), quantized weights Q, fully packed.
  • Alignment: Metadata (8B), data (16B) for SIMD/blockload efficiency.

Minimal loading pseudocode (LaTeX-style) (Yadav et al., 6 Dec 2025):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
\begin{verbatim}
// Pseudocode for reading a GGUF model
class GGUFModel {
  Header hdr = read_header(file);
  for (i=0; i<hdr.n_fields; ++i) {
    KeyValue kv = read_key_value(file);
    store_metadata(kv.key, kv.value);
  }
  // Now read weight arrays one by one
  for (tensor_meta in metadata.tensor_list) {
    float[] W = mmap_weights(file, tensor_meta.offset, tensor_meta.length);
    weights.push_back(W);
  }
}
\end{verbatim}
At inference, dequantization is performed as:

X^i,j=dscalesQscales[i]Qi,j+dminsQmins[i]\widehat X_{i,j} = dscales\,Qscales[i]\,Q_{i,j}+dmins\,Qmins[i]

for each (i,j) in the superblock (Egashira et al., 24 May 2025).

6. Limitations, Security, and Considerations

  • Quantization Artifacts: There is a small, quantifiable drop in metrics (e.g., MMLU) and possible minor task-specific degradation (Yadav et al., 6 Dec 2025).
  • Security: Complex blockwise schemes do not, by themselves, preclude adversarial weight injection. Tailored attacks exploiting quantization error (Δw) can still embed malicious behaviors invisible in the original weights, with documented effectiveness on insecure code generation (Δ=88.7%\Delta=88.7\% success), targeted content injection (85.0%), and benign instruction refusal (30.1%) (Egashira et al., 24 May 2025).
  • Platform Constraints: GGUF is optimized for memory-mapping; loading very large models on constrained RAM can lead to swapping or failure if memory is critically low.
  • Usability: Current practical usage on Android is typically via Termux or CLI wrappers. A native Android loader would offer superior ergonomics.
  • Benchmark Coverage: Reported inference speed is qualitative only; exact tokens/sec for chip varieties and multi-core scaling is not systematically published (Yadav et al., 6 Dec 2025).

7. Context and Significance

GGUF has become the de facto standard for quantized LLM distribution within the llama.cpp and Ollama ecosystems, enabling efficient mobile and edge inference with backed quantization types suitable for modern transformer models. Its blend of single-file portability, native quantization support, and explicit metadata for all critical model parameters offers compelling trade-offs in model size, accuracy, and deployment simplicity. Research has identified that GGUF’s quantization approach—though advanced in error minimization and memory layout—cannot be relied upon solely for defense against adversarial modification, and model consumers should perform appropriate behavioral validation on quantized artifacts (Egashira et al., 24 May 2025, Yadav et al., 6 Dec 2025). A plausible implication is that as GGUF becomes more widespread in edge and consumer devices, both standardization around format specifications and the development of robust integrity verification and threat detection methods will become increasingly essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GGUF Format.