Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (2504.11651v1)

Published 15 Apr 2025 in cs.LG and cs.DC

Abstract: LLMs have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at https://github.com/LeanModels/DFloat11.

Summary

  • The paper introduces DFloat11, a lossless LLM compression framework that reduces model size by around 30% while preserving bitwise identical outputs to BF16.
  • It employs Huffman coding on the exponent bits of BF16 weights and a custom GPU kernel with multi-stage decoding to achieve efficient on-the-fly decompression.
  • Experiments demonstrate 1.85–38.83× throughput improvements and extended context lengths, making LLM inference more scalable and efficient on single GPUs.

LLMs are growing rapidly in size, making efficient deployment challenging, especially on hardware with limited memory. While lossy compression techniques like quantization reduce model size significantly, they can compromise accuracy and alter the model's output distribution, which is undesirable for applications requiring high fidelity. Existing lossless compression methods primarily focus on reducing storage size but do not offer benefits for inference on GPUs.

This paper introduces Dynamic-Length Float (DFloat11), a lossless compression framework designed to reduce LLM size by approximately 30% while ensuring bit-for-bit identical outputs compared to the original BFloat16 (BF16) models. The core idea is based on the observation that the BF16 representation of LLM weights is information-inefficient. Analysis shows that while the sign and mantissa components have entropy close to their bit width, the 8-bit exponent has significantly lower entropy (around 2.6 bits), indicating redundancy.

DFloat11 leverages this inefficiency by applying entropy coding, specifically Huffman coding, to the exponents of the BF16 weights. Shorter codes are assigned to more frequent exponent values, achieving near information-optimal compression. The original sign and mantissa bits are kept uncompressed. The compressed exponents are stored in a tightly packed byte array (EncodedExponent\mathbf{EncodedExponent}), while the uncompressed sign and mantissa bits are stored in another byte array (PackedSignMantissa\mathbf{PackedSignMantissa}). This results in an effective bit width of around 11 bits per parameter, hence the name DFloat11.

The main challenge for practical application is performing efficient GPU inference with these variable-length encoded weights, as they must be decompressed on-the-fly back to BF16 format before matrix multiplications. Traditional sequential decoding is unsuitable for the massively parallel architecture of GPUs. To address this, the paper develops a custom GPU kernel for fast online decompression.

The GPU kernel design incorporates three key components:

  1. Efficient Decoding with Compact LUTs: To decode the variable-length Huffman codes in parallel, a lookup table (LUT) approach is used. A direct LUT for a maximum code length of 32 bits would be too large (2322^{32} entries). The paper proposes decomposing this into four smaller 282^8-entry LUTs (LUT1,LUT2,LUT3,LUT4\mathbf{LUT}_1, \mathbf{LUT}_2, \mathbf{LUT}_3, \mathbf{LUT}_4) along with a CodeLengths\mathbf{CodeLengths} table, all fitting within GPU SRAM for fast access. Decoding involves reading 4 bytes (32 bits) from the encoded stream and using successive bytes to index into LUT1\mathbf{LUT}_1 through LUT4\mathbf{LUT}_4 until a decoded exponent is found. The CodeLengths\mathbf{CodeLengths} table provides the actual length of the decoded code to advance the bitstream. Ambiguities are resolved by slightly adjusting frequency distributions during Huffman tree construction.
  2. Two-Phase Kernel and Lightweight Auxiliary Variables: To parallelize decoding across threads, each thread is assigned a fixed number of bytes from the encoded stream. However, determining the starting bit position and the output position for each thread is challenging due to variable code lengths. A small Gaps\mathbf{Gaps} array (5 bits per thread) stores the starting bit offset for each thread's chunk. To avoid large overhead for storing output positions for every thread, output positions are stored only per thread block in a BlockOutputPos\mathbf{BlockOutputPos} array. The decompression uses a two-phase kernel:
    • Phase 1: Threads decode their assigned bytes and count the number of elements they will produce, storing counts in SRAM. After thread synchronization within the block, thread-specific output positions are computed by summing the counts, starting from the block's output position.
    • Phase 2: Threads re-decode the same bytes (loaded into SRAM) and write the decompressed BF16 values to their calculated output positions in global memory. Algorithm 1 provides a pseudocode representation of this two-phase kernel:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      
      Procedure DFloatToBFloat():
          // ... (Load data and LUTs into SRAM) ...
      
          // Phase 1: Determine initial output position for each thread
          ForAll t in threads (in parallel):
              BitOffset = Gaps[block*T + t]
              NumElements[t] = 0
              While BitOffset < 8*n:
                  // Decode exponent using multi-LUT lookup
                  Exponent = DecodeExponent(EncodedExponent_chunk[t], BitOffset, LUTs, CodeLengths)
                  BitOffset += CodeLengths[Exponent]
                  NumElements[t] += 1
      
          Thread Synchronization Barrier
      
          // Compute thread output positions via prefix sums
          BlockOutputPos[block] = ComputeBlockOutputPosition(block) # Computed before kernel launch
          ThreadOutputPos[t] = BlockOutputPos[block] + sum(NumElements[i] for i in range(t))
      
          // Phase 2: Write decoded BFloat16s
          BitOffset = Gaps[block*T + t]
          While BitOffset < 8*n:
              // Decode exponent again (using SRAM data)
              Exponent = DecodeExponent(EncodedExponent_chunk[t], BitOffset, LUTs, CodeLengths)
      
              // Construct BFloat16 from Sign, Mantissa (from PackedSignMantissa), and decoded Exponent
              BFloat16Value = ConstructBFloat16(PackedSignMantissa[ThreadOutputPos[t]], Exponent)
      
              // Write to global memory
              Outputs[ThreadOutputPos[t]] = BFloat16Value
      
              BitOffset += CodeLengths[Exponent]
              ThreadOutputPos[t] += 1
  3. Transformer-Block-Level Decompression: To improve GPU utilization and amortize decompression overhead, weights are not decompressed individually but batched together for all matrices within a transformer block. All weights for a block are decompressed before any computation is performed for that block.

Experiments on various recent LLMs (Llama-3.1, Qwen2.5, Gemma-3, etc.) show consistent compression ratios of around 70% (effective ~11 bits) without any change in accuracy or perplexity on standard benchmarks like MMLU, TruthfulQA, WikiText, and C4. Bit-level comparisons confirm the lossless nature.

Inference performance evaluations demonstrate that DFloat11-compressed models running on a single GPU significantly outperform uncompressed BF16 models that require CPU offloading due to memory constraints, achieving 1.85–38.83×\times better latency or throughput. When compared to BF16 models split across two GPUs, DF11 on a single GPU shows competitive or better performance depending on the model and batch size (Figure 10 in Appendix). Furthermore, the memory savings from DF11 allow for a substantially larger KV cache, enabling 5.33–13.17×\times longer context lengths before running out of GPU memory.

An ablation paper shows that while decompression adds overhead compared to BF16, this overhead is fixed per block and becomes negligible relative to matrix multiplication time as the token batch size increases. Comparing the DF11 decompression kernel against CPU-to-GPU transfer and NVIDIA's nvCOMP library shows that DF11 decompression is significantly faster (up to 24.87×\times and 15.12×\times respectively) and provides better compression than nvCOMP.

DFloat11 provides a practical approach for deploying large LLMs on less hardware and with longer context lengths by leveraging the redundancy in BF16 weights and designing a highly parallel, hardware-aware GPU decompression kernel. The implementation is integrated into popular inference frameworks like Hugging Face Transformers, making it readily applicable. The code and models are publicly available.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com