- The paper introduces DFloat11, a lossless LLM compression framework that reduces model size by around 30% while preserving bitwise identical outputs to BF16.
- It employs Huffman coding on the exponent bits of BF16 weights and a custom GPU kernel with multi-stage decoding to achieve efficient on-the-fly decompression.
- Experiments demonstrate 1.85–38.83× throughput improvements and extended context lengths, making LLM inference more scalable and efficient on single GPUs.
LLMs are growing rapidly in size, making efficient deployment challenging, especially on hardware with limited memory. While lossy compression techniques like quantization reduce model size significantly, they can compromise accuracy and alter the model's output distribution, which is undesirable for applications requiring high fidelity. Existing lossless compression methods primarily focus on reducing storage size but do not offer benefits for inference on GPUs.
This paper introduces Dynamic-Length Float (DFloat11), a lossless compression framework designed to reduce LLM size by approximately 30% while ensuring bit-for-bit identical outputs compared to the original BFloat16 (BF16) models. The core idea is based on the observation that the BF16 representation of LLM weights is information-inefficient. Analysis shows that while the sign and mantissa components have entropy close to their bit width, the 8-bit exponent has significantly lower entropy (around 2.6 bits), indicating redundancy.
DFloat11 leverages this inefficiency by applying entropy coding, specifically Huffman coding, to the exponents of the BF16 weights. Shorter codes are assigned to more frequent exponent values, achieving near information-optimal compression. The original sign and mantissa bits are kept uncompressed. The compressed exponents are stored in a tightly packed byte array (EncodedExponent), while the uncompressed sign and mantissa bits are stored in another byte array (PackedSignMantissa). This results in an effective bit width of around 11 bits per parameter, hence the name DFloat11.
The main challenge for practical application is performing efficient GPU inference with these variable-length encoded weights, as they must be decompressed on-the-fly back to BF16 format before matrix multiplications. Traditional sequential decoding is unsuitable for the massively parallel architecture of GPUs. To address this, the paper develops a custom GPU kernel for fast online decompression.
The GPU kernel design incorporates three key components:
- Efficient Decoding with Compact LUTs: To decode the variable-length Huffman codes in parallel, a lookup table (LUT) approach is used. A direct LUT for a maximum code length of 32 bits would be too large (232 entries). The paper proposes decomposing this into four smaller 28-entry LUTs (LUT1,LUT2,LUT3,LUT4) along with a CodeLengths table, all fitting within GPU SRAM for fast access. Decoding involves reading 4 bytes (32 bits) from the encoded stream and using successive bytes to index into LUT1 through LUT4 until a decoded exponent is found. The CodeLengths table provides the actual length of the decoded code to advance the bitstream. Ambiguities are resolved by slightly adjusting frequency distributions during Huffman tree construction.
- Two-Phase Kernel and Lightweight Auxiliary Variables: To parallelize decoding across threads, each thread is assigned a fixed number of bytes from the encoded stream. However, determining the starting bit position and the output position for each thread is challenging due to variable code lengths. A small Gaps array (5 bits per thread) stores the starting bit offset for each thread's chunk. To avoid large overhead for storing output positions for every thread, output positions are stored only per thread block in a BlockOutputPos array. The decompression uses a two-phase kernel:
- Phase 1: Threads decode their assigned bytes and count the number of elements they will produce, storing counts in SRAM. After thread synchronization within the block, thread-specific output positions are computed by summing the counts, starting from the block's output position.
- Phase 2: Threads re-decode the same bytes (loaded into SRAM) and write the decompressed BF16 values to their calculated output positions in global memory.
Algorithm 1 provides a pseudocode representation of this two-phase kernel:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
Procedure DFloatToBFloat():
// ... (Load data and LUTs into SRAM) ...
// Phase 1: Determine initial output position for each thread
ForAll t in threads (in parallel):
BitOffset = Gaps[block*T + t]
NumElements[t] = 0
While BitOffset < 8*n:
// Decode exponent using multi-LUT lookup
Exponent = DecodeExponent(EncodedExponent_chunk[t], BitOffset, LUTs, CodeLengths)
BitOffset += CodeLengths[Exponent]
NumElements[t] += 1
Thread Synchronization Barrier
// Compute thread output positions via prefix sums
BlockOutputPos[block] = ComputeBlockOutputPosition(block) # Computed before kernel launch
ThreadOutputPos[t] = BlockOutputPos[block] + sum(NumElements[i] for i in range(t))
// Phase 2: Write decoded BFloat16s
BitOffset = Gaps[block*T + t]
While BitOffset < 8*n:
// Decode exponent again (using SRAM data)
Exponent = DecodeExponent(EncodedExponent_chunk[t], BitOffset, LUTs, CodeLengths)
// Construct BFloat16 from Sign, Mantissa (from PackedSignMantissa), and decoded Exponent
BFloat16Value = ConstructBFloat16(PackedSignMantissa[ThreadOutputPos[t]], Exponent)
// Write to global memory
Outputs[ThreadOutputPos[t]] = BFloat16Value
BitOffset += CodeLengths[Exponent]
ThreadOutputPos[t] += 1 |
- Transformer-Block-Level Decompression: To improve GPU utilization and amortize decompression overhead, weights are not decompressed individually but batched together for all matrices within a transformer block. All weights for a block are decompressed before any computation is performed for that block.
Experiments on various recent LLMs (Llama-3.1, Qwen2.5, Gemma-3, etc.) show consistent compression ratios of around 70% (effective ~11 bits) without any change in accuracy or perplexity on standard benchmarks like MMLU, TruthfulQA, WikiText, and C4. Bit-level comparisons confirm the lossless nature.
Inference performance evaluations demonstrate that DFloat11-compressed models running on a single GPU significantly outperform uncompressed BF16 models that require CPU offloading due to memory constraints, achieving 1.85–38.83× better latency or throughput. When compared to BF16 models split across two GPUs, DF11 on a single GPU shows competitive or better performance depending on the model and batch size (Figure 10 in Appendix). Furthermore, the memory savings from DF11 allow for a substantially larger KV cache, enabling 5.33–13.17× longer context lengths before running out of GPU memory.
An ablation paper shows that while decompression adds overhead compared to BF16, this overhead is fixed per block and becomes negligible relative to matrix multiplication time as the token batch size increases. Comparing the DF11 decompression kernel against CPU-to-GPU transfer and NVIDIA's nvCOMP library shows that DF11 decompression is significantly faster (up to 24.87× and 15.12× respectively) and provides better compression than nvCOMP.
DFloat11 provides a practical approach for deploying large LLMs on less hardware and with longer context lengths by leveraging the redundancy in BF16 weights and designing a highly parallel, hardware-aware GPU decompression kernel. The implementation is integrated into popular inference frameworks like Hugging Face Transformers, making it readily applicable. The code and models are publicly available.