Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting (1901.09504v3)

Published 28 Jan 2019 in cs.LG and stat.ML
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Abstract: Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs, while this work examines the less-studied topic of quantizing a floating-point model without (re)training. DNN weights and activations follow a bell-shaped distribution post-training, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classification and LLMing shows that OCS can outperform state-of-the-art clipping techniques with only minor overhead.

Improving Neural Network Quantization Using Outlier Channel Splitting: A Formal Overview

The paper "Improving Neural Network Quantization without Retraining using Outlier Channel Splitting" presents an advanced exploration of post-training quantization techniques for deep neural networks (DNNs). This work targets the challenge of quantizing floating-point models without retraining, a practical scenario often faced by ML service providers who operate black-box client models and lack access to the training data.

Problem Statement and Methodology

Quantization aims to reduce the execution latency and energy costs of DNNs by converting floating-point weights and activations to low-precision representations. The authors tackle the issue of outliers in DNN weight distributions, which can escalate mean squared quantization error (MSE). Unlike conventional approaches that rely on clipping these outliers, the paper introduces the Outlier Channel Splitting (OCS) technique.

OCS duplicates channels with outliers and halves the values, creating a functionally equivalent network while centralizing outliers within the distribution. This modification does not necessitate retraining and is compatible with current hardware structures.

Empirical Evaluation and Insights

The implementations and evaluations span ImageNet-based convolutional neural networks (CNNs) such as VGG16, ResNet-50, DenseNet-121, and Inception-V3, in addition to RNNs for LLMing. The results are compelling:

  • OCS surpasses state-of-the-art clipping methods at weights quantized to lower bitwidths, demonstrating significant accuracy retention with negligible memory overhead.
  • At 5 and 6 bits, the OCS method with modest expansion ratios outperformed clipping by margins reaching up to 13% for some models.
  • Combining OCS with clipping — particularly at very low precision — yields superior results compared to individual methods.
  • Activation quantization using OCS was less effective, likely due to profiling inaccuracies, but an oracle version demonstrated potential.

Technical Distinctions

The methodology offers notable improvements over existing practices by addressing outlier challenges without necessitating specialized hardware. The quantization-aware splitting mitigates potential increases in quantization error and promises a more efficient representation of the neural network data.

Implications and Future Directions

The proposed OCS method significantly enhances the scope of neural network deployment in real-world applications where retraining is impractical. It aligns well with contemporary commercial systems like NVIDIA’s TensorRT, suggesting broad applicability.

Future research directions may involve refining channel selection strategies and integrating OCS into training processes. These refinements could optimize weight distributions for quantization, potentially elevating model accuracy in quantized operations. Moreover, expanding the paper into dynamic selection techniques for activations could enhance OCS's utility in those contexts.

This paper represents an incremental yet substantial advancement in the optimization of post-training quantization, potentially influencing both academic research and industry practices.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ritchie Zhao (11 papers)
  2. Yuwei Hu (15 papers)
  3. Jordan Dotzel (13 papers)
  4. Christopher De Sa (77 papers)
  5. Zhiru Zhang (51 papers)
Citations (291)