Papers
Topics
Authors
Recent
2000 character limit reached

ConvLinear4bit: 4-bit Quantization Methods

Updated 10 December 2025
  • ConvLinear4bit is a quantization approach that compresses convolution and linear layers to 4 bits, balancing low-bit representation with minimal accuracy degradation.
  • It integrates quantization-aware training, plug-and-play reparameterizations, and residual correction to optimize model performance and support efficient hardware implementations.
  • The method achieves significant size reductions and speedups, as evidenced by empirical evaluations across ASR, image classification, diffusion models, and LLMs.

ConvLinear4bit refers to a family of methods, modules, and micro-kernels for designing, quantizing, and deploying convolutional and linear layers where both weights and activations are quantized to 4 bits. This approach has emerged as the leading practical instantiation of ultra-low-bitwidth inference in deep networks, balancing aggressive model compression with minimal accuracy loss. ConvLinear4bit implementations now appear in post-training quantization (PTQ), quantization-aware training (QAT), architecture search with residual correction, plug-and-play rotation/frequency-domain smoothing, and hardware-adapted kernels for CPUs, GPUs, FPGAs, and custom processors.

1. Theoretical Foundations of 4-bit Quantization

ConvLinear4bit quantizes model parameters and intermediate activations to 4 bits, typically using uniform symmetric or affine quantizers. The most widely adopted mathematical formulation uses a symmetric, uniform k-bit quantizer (k=4) with scale ss: Q(x)=sclip(round(xs),  2k1,  2k11)Q(x) = s \cdot \mathrm{clip}\left( \mathrm{round}\left(\frac{x}{s}\right),\; -2^{k-1},\; 2^{k-1}-1\right) where ss is computed either per-tensor or per-channel as s=maxx/(2k11)s = \max |x| / (2^{k-1} - 1). For activations, unsigned quantization is often used with sa=maxx/(2k1)s_a = \max x / (2^{k}-1) and the output range [0,2k1][0, 2^{k}-1] (Ding et al., 2022).

Multiple axis granularities are supported:

  • Per-channel (e.g., quantizing weights across output channels)
  • Per-tensor/block (entire matrix or predefined groups)

State-of-the-art methods also address dynamic range adaptation through domain transformations, such as group-wise Hadamard rotations (Huang et al., 3 Dec 2025), frequency-domain smoothing (Tao et al., 2021), and block floating-point exponent sharing (Gennari et al., 2019).

2. Algorithmic and Architectural Implementations

Quantization-Aware Training (QAT) and Inference

In QAT, 4-bit quantization is integrated into training loops. Rounding functions use the straight-through estimator for gradient propagation, and all forward matmuls and convolutions run as true integer arithmetic if hardware permits, or exact emulation in float32. By training from scratch with QAT and minimizing additional quantization-specific overhead, models can achieve float-equivalent accuracy with 5–6× size reduction (Ding et al., 2022).

Plug-and-Play 4-bit Reparameterizations

For PTQ and efficient transfer to low-bit inference, ConvLinear4bit modules may replace standard linear or convolutional operators as drop-in alternatives. For example, ConvRot (Huang et al., 3 Dec 2025) applies a group-wise Regular Hadamard Transform (RHT) to smooth outliers, followed by uniform 4-bit quantization, INT4 GEMM, and inverse RHT. This eliminates retraining or calibration requirements and is compatible with diffusion and transformer architectures.

Residual Knowledge Reclamation

More recent paradigms add low-rank adapter branches to reclaim the quantization error (residual knowledge) lost in n-bit quantization. The CoRa framework (Luo et al., 1 Aug 2024) decomposes the residual between full-precision and quantized weights via SVD, inserting per-layer low-rank corrections. The adapter ranks are searched efficiently on a small calibration set, yielding state-of-the-art accuracy at almost no additional parameter cost and vastly reduced adaptation time.

3. Hardware Adaptations and Microarchitectural Kernels

ConvLinear4bit has motivated specialized hardware and micro-kernel designs to fully exploit the computational and memory advantages of 4-bit data paths.

  • Generic CPUs/GPUs: NEON and AVX2–supported 4-bit–serial GEMM/conv kernels utilize bit-packing, bit-plane decomposition or bit-serial microkernels, and vectorized popcount-based MACs (Tulloch et al., 2017, Trusov et al., 2020).
  • HiKonv Bit-Slice Multipliers: Polynomial packing of p=q=4-bit operands into 32- or 27-bit registers enables one integer multiply to produce up to 13 parallel MACs (CPU ALU) or 8 MACs (FPGA DSP) per cycle (Chen et al., 2022, Liu et al., 2021).
  • Custom Processors: RISC-V vector extensions (Sparq) introduce fused vmacsr instructions for sub-byte multiply–shift–accumulate, enabling 1.7×–3.2× speedup over 16-bit kernels (Dupuis et al., 2023).
  • Tensor Cores: Modern GPUs (e.g., NVIDIA A100) provide INT4 MMA instructions; proper packing, permutation, and mixed-precision scheduling (e.g., COMET’s W4Ax kernel) achieve up to 2.88× kernel–level speedups for LLMs and vision transformers (Liu et al., 16 Oct 2024).

Table 1: Core Packing and Throughput Characteristics

Platform Packing & Kernel Throughput Achieved
ARM NEON/Cortex (CPU) Bit-serial or 4-bit slices 4–20× speedup vs float32
Xilinx FPGA DSP48E2 3×2 slices, 11-bit guard 8 MACs/cycle/DSP, 2.6× baseline
NVIDIA GPU (A100) Block-wise INT4 MMA 2.88× kernel speedup (Liu et al., 16 Oct 2024)

4. Representative Quantization Recipes and Schemes

Blockwise and Groupwise Strategies

To minimize quantization noise and preserve activation fidelity, blockwise schemes (DSConv, COMET) quantize tensors in small groups or blocks, sometimes allocating higher bitwidth (e.g., 8-bit) to outlier blocks identified via magnitude thresholds. These blocks are then permuted for optimal compute alignment, and quantization/dequantization becomes block-local (Liu et al., 16 Oct 2024, Gennari et al., 2019).

Rotation/Domain-Smoothing Approaches

Groupwise rotations (e.g., via Hadamard or DFT) significantly reduce the prevalence of outlier channels, enabling plug-and-play W4A4 quantization with minimal visual or top-1 accuracy loss, as demonstrated in diffusion transformers and CNNs (Huang et al., 3 Dec 2025, Tao et al., 2021).

Residual Adapters for Error Correction

When quantization error dominates and simple quantizers saturate, insertion of SVD-based low-rank adapters for each layer (as in CoRa) allows the n-bit ConvLinear layer to approach or match full-precision model accuracy with <1.25% memory overhead and 100-1000× reduction in adaptation iteration count versus prior art (Luo et al., 1 Aug 2024).

5. Empirical Evaluations and Real-World Results

Large-scale evaluations across modalities have established the viability and efficiency of ConvLinear4bit:

  • ASR (LibriSpeech, 118M Conformer-Large): 4-bit QAT achieves 2.0/4.4 WER (dev/test), matching float32; model size reduced from 475 MB to 82 MB (5.8×) (Ding et al., 2022).
  • ImageNet (ResNet-18, MobileNet-V2): 4/4 uniform (+FAT) yields 70.5%/69.2% top-1, 7.7×/6.7× smaller than float (Tao et al., 2021). DSConv with block FP activations obtains <1.3% drop from FP, with 8× weight compression (Gennari et al., 2019).
  • Diffusion Transformers: ConvRot W4A4 yields 2.26× speedup, 4.05× memory reduction, FID change <+0.25 (Huang et al., 3 Dec 2025).
  • LLMs (LLaMA-70B, A100 GPU): COMET W4A4/W4A8 achieves up to 2.88× kernel and 2.02× end-to-end speedup over best FP16 baselines; >80% activations are in A4 (Liu et al., 16 Oct 2024).
  • Hardware: Bitwise/bit-slice designs yield 3.1–3.2× latency reductions (32-bit CPU), 2.37–2.61× DSP efficiency improvements (FPGA), 1.7×–3.2× speedups over 16-bit (Sparq RISC-V) (Tulloch et al., 2017, Chen et al., 2022, Liu et al., 2021, Dupuis et al., 2023).

6. Practical Considerations, Limitations, and Deployment

ConvLinear4bit deployment requires careful selection of kernel/data layout, packing/unpacking strategies, and mixed-precision fallbacks:

  • Packing: Two 4-bit values/byte; im2col, channel permutation, NEON/SIMD alignment.
  • Quantizer Granularity: Per-tensor/per-channel/group, with outlier channel detection in mixed W4A8.
  • Integration: Fused quant–packing, memory-aligned buffers, minimal extra compute for adaptors or dequantization.
  • Limitations: Full-W4A4 activations may not always match float accuracy in all models (notably, some small ASR models or ResNets), especially if outlier handling is not performed (Ding et al., 2022, Liu et al., 16 Oct 2024). Some schemes (e.g., CoRa) currently only quantize weights; FAT/rotation schemes require extra pre/post transforms at deployment if retraining is not possible.

7. Impact and Future Directions

The ConvLinear4bit approach consolidates several advances in low-bitwidth deep learning:

  • It underpins practical, large-scale deployment on highly resource-constrained edge and cloud platforms.
  • Extensions—such as plug-and-play domain smoothing, block/adaptive quantization, and residual correction—enable quantization to 4 bits with negligible loss in diverse architectures, including CNNs, transformers, diffusion models, and LLMs.
  • Hardware co-design (ISA extensions, bitwise multipliers, custom tensor-core scheduling) unlocks the true computational and energy efficiency potential.
  • Ongoing work targets activation quantization stability, better outlier management, and seamless integration of low-rank adapters for transformers and LLMs (Luo et al., 1 Aug 2024, Huang et al., 3 Dec 2025, Liu et al., 16 Oct 2024).

In summary, ConvLinear4bit provides the foundational operator class for the next generation of efficient, low-latency, low-footprint deep network inference, integrating advances in quantization theory, algorithmic smoothing, distribution matching, and hardware efficiency across a wide variety of neural architectures and deployment substrates.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ConvLinear4bit.