Papers
Topics
Authors
Recent
Search
2000 character limit reached

StudentFloat-4: 4-Bit t-Distribution Quantization

Updated 19 May 2026
  • StudentFloat-4 (SF4) is a 4-bit lookup-based quantization format that uses a Student’s t-distribution with ν≈5 to better model weight and activation tensors in deep neural networks.
  • It constructs its 16-codebook grid by partitioning the cumulative probability mass of the t-distribution, leading to improved representational accuracy particularly for large language models.
  • While SF4 offers up to 0.9 percentage point accuracy gains over methods like NF4, its hardware implementation introduces significant area and power overhead compared to simpler INT4 schemes.

StudentFloat-4 (SF4) is a 4-bit lookup-based quantization format designed to efficiently approximate the weight and activation distributions encountered in contemporary LLMs and other deep neural architectures. Unlike earlier quantization schemes assuming Gaussian statistics, SF4 is derived from the empirical observation that these tensors are better modeled by heavy-tailed, centrally-peaked Student’s t-distributions. By constructing a quantization grid based on this distributional form (specifically, with parameter ν5\nu \approx 5), SF4 achieves improvements in model accuracy over previous formats such as Normal Float (NF4) and INT4, while introducing new tradeoffs in hardware complexity and implementability (Dotzel et al., 2024).

1. Distributional Motivation and Theoretical Foundation

Profiling the weight and activation tensors of thirty modern models—including LLMs, BERT variants, ResNets, and ViTs—reveals a consistent mismatch with the Gaussian assumption: empirical distributions typically exhibit sharper central peaks and substantially heavier tails, a property effectively captured by the zero-mean Student’s t-distribution. The t-distribution, parameterized by degrees of freedom ν\nu, is given by:

f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}

The parameter ν\nu controls tail weight—smaller values yield heavier tails (ν=1\nu=1 recovers the Cauchy), while ν\nu \rightarrow \infty approaches the standard normal. Model profiling via Kolmogorov–Smirnov distance minimization demonstrates that, for weights, best-fit ν\nu typically ranges from 3 to 8, and often less than 5 for activations. This motivates the design of a quantizer that closely tracks the true heavy-tailed tensor distribution to preserve model accuracy under low-bitwidth representations.

2. Quantization Grid: Derivation and Construction

SF4's codebook of 24=162^4 = 16 values is defined by equally partitioning the cumulative probability mass of the standard t-distribution. The process, as formalized in Algorithm 1 of (Dotzel et al., 2024), uses a small tail-truncation parameter δ12(1/32+1/30)\delta \approx \frac{1}{2}(1/32+1/30). The cumulative-probability cut points are:

  • pi=δ+(i1)1/2δ7p_i = \delta + (i-1)\cdot\frac{1/2 - \delta}{7} for ν\nu0,
  • ν\nu1 for ν\nu2.

Quantized codebook entries are ν\nu3, where ν\nu4 is the quantile function of the t-distribution. Codebook normalization enforces ν\nu5 so that ν\nu6 and ν\nu7. For SF4, a fixed ν\nu8 is recommended to match typical LLM tensor statistics, enabling precomputed universal codebooks.

Quantization is performed per block of 128 tensor elements. Each value is scaled by the maximum absolute entry in its block (ν\nu9), clipped to f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}0, and mapped to the nearest codebook entry. Elements outside this interval are clamped to the boundary codepoints, treating them as supernormals.

3. Format Definition, Implementation, and Comparative Properties

SF4 is a pure 4-bit lookup format without explicit exponent/mantissa structure, in contrast to conventional floating-point types. The 4 bits index one of the 16 t-quantiles. SF4 closely resembles NF4 (Normal Float), differing only in the selection of grid points (t-distribution vs. standard-normal quantiles), and shares the same blockwise scaling and symmetric clipping mechanics.

Key contrasts include:

Format Grid Construction Range Assignment Exponent/Mantissa
SF4 Student’s t quantiles (f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}1) f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}2 blockwise, lookup None (4-bit index)
NF4 Gaussian quantiles f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}3 blockwise, lookup None (4-bit index)
INT4 Uniform int (f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}4) Linear, no lookup None
E2M1 Floating-point, 2e1m1 split Exponent/mantissa Yes

During inference, dequantization entails lookups using the stored code index, rescaled by the corresponding f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}5.

4. Empirical Performance and Accuracy Tradeoffs

SF4 demonstrates improved representational accuracy relative to prior formats under both weight-only and weight-and-activation (W4A4) quantization:

  • On LLaMA2-7B (average zero-shot accuracy, six tasks):
    • FP32: f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}6 drop
    • INT4: f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}7
    • NF4: f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}8
    • SF4: f(x;ν)=Γ(ν+12)νπΓ(ν2)(1+x2/ν)(ν+1)/2f(x;\,\nu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} (1 + x^2/\nu)^{-(\nu+1)/2}9
    • E2M1: ν\nu0
    • E2M1+super-precision: ν\nu1

SF4 delivers up to ν\nu2 percentage point gain over NF4, with a reported ν\nu3 average absolute improvement on LLaMA2-7B tasks. SF4 often matches or exceeds the performance of 5-bit integer quantization.

For W4A4 quantization, SF4 leads lookup-float variants: on Mistral-7B, SmoothQuant accuracy drop is ν\nu4 for SF4 versus ν\nu5 for NF4. Across diverse models (Mistral-7B, OPT-1B, Phi-2, BLOOM-7B, Yi-6B), SF4 consistently narrows the quantization accuracy gap.

5. Hardware Cost and Implementation Implications

While SF4 achieves favorable accuracy, its hardware realization introduces significant area overheads relative to integer and hybrid floating-point schemes. In a 28nm process, a single SF4 multiply-accumulate (MAC) unit requires ν\nu6 (multiplier) plus ν\nu7 (19-bit accumulator), totaling ν\nu8 and dissipating approximately ν\nu9W.

Comparative area requirements for MACs:

Format Multiplier Area (ν=1\nu=10) Accumulator Area (ν=1\nu=11) Total MAC (ν=1\nu=12)
INT4 86.8 73.9 160.7
E2M1 110.9 59.5 170.4
SF4 121.5 96.5 218.0

At a system level (assuming 10% MAC, 60% memory, 30% control), supporting SF4 as a native floating-point would increase chip area by ν=1\nu=13 compared to INT4. Efficient alternatives, such as E2M1 with super-precision (one additional codepoint), require only ν=1\nu=14 area overhead with nearly all of the 4-bit accuracy loss recouped. As a result, while SF4 is a high-accuracy reference, it is considered impractical for deployment in current 4-bit ASIC designs.

6. Practical Algorithm and Workflow

Weight-only quantization using SF4 proceeds as follows:

  1. Partition each weight matrix into blocks of 128 columns.
  2. For each block, compute the scale ν=1\nu=15 (using FP16 or FP32).
  3. Normalize: ν=1\nu=16, then clamp to ν=1\nu=17.
  4. For each ν=1\nu=18, find ν=1\nu=19 and store ν\nu \rightarrow \infty0 in 4 bits.
  5. Save ν\nu \rightarrow \infty1 alongside the 4-bit codeword array.

Inference reverses quantization: each code index is mapped back via the codebook and multiplied by ν\nu \rightarrow \infty2.

7. Significance, Limitations, and Pareto Analysis

SF4’s principal contribution is the demonstration that a quantizer grid derived from a Student’s t-distribution with ν\nu \rightarrow \infty3 more faithfully matches actual LLM tensor statistics than Gaussian-based grids. This distribution-matched codebook yields consistent accuracy improvements in both W4 and W4A4 quantization settings. However, the area and power costs of implementing full SF4 MACs render the scheme suboptimal in ASIC contexts demanding minimal overhead.

Pareto analysis—plotting average accuracy loss against system-level chip area—reveals that INT4, E2M1, and E2M1 augmented with super-precision support define the practical efficiency frontier. The insight of matching quantizer grids to heavy-tailed statistics can be transferred to these hybrid formats, enabling nearly all of SF4’s accuracy gains at vastly lower hardware cost. The work provides both an upper-bound reference for 4-bit float quantization accuracy and a roadmap toward near-optimal, cost-effective dtype design (Dotzel et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StudentFloat-4 (SF4).