Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer (2111.13824v4)

Published 27 Nov 2021 in cs.CV
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Abstract: Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments. However, most existing quantization methods have been developed mainly on Convolutional Neural Networks (CNNs), and suffer severe degradation when applied to fully quantized vision transformers. In this work, we demonstrate that many of these difficulties arise because of serious inter-channel variation in LayerNorm inputs, and present, Power-of-Two Factor (PTF), a systematic method to reduce the performance degradation and inference complexity of fully quantized vision transformers. In addition, observing an extreme non-uniform distribution in attention maps, we propose Log-Int-Softmax (LIS) to sustain that and simplify inference by using 4-bit quantization and the BitShift operator. Comprehensive experiments on various transformer-based architectures and benchmarks show that our Fully Quantized Vision Transformer (FQ-ViT) outperforms previous works while even using lower bit-width on attention maps. For instance, we reach 84.89% top-1 accuracy with ViT-L on ImageNet and 50.8 mAP with Cascade Mask R-CNN (Swin-S) on COCO. To our knowledge, we are the first to achieve lossless accuracy degradation (~1%) on fully quantized vision transformers. The code is available at https://github.com/megvii-research/FQ-ViT.

Post-Training Quantization for Vision Transformers: FQ-ViT Insights

The paper at hand introduces FQ-ViT, a novel approach to post-training quantization specifically tailored for Vision Transformers (ViTs). This paper addresses the common challenge of reducing the inference complexity of ViTs, which usually involve substantial computational overhead due to their large parameter size compared to their CNN counterparts. The authors propose two primary techniques: the Power-of-Two Factor (PTF) and Log-Int-Softmax (LIS), which together facilitate effective quantization of LayerNorm inputs and attention maps, respectively.

Key Contributions

  1. Revisiting Quantization Obstacles: The research identifies two significant issues in vision transformers quantization: the pronounced inter-channel variation in LayerNorm inputs, and the extreme non-uniform distribution of attention maps. These challenges exacerbate quantization errors that impede deploying ViTs on resource-constrained devices.
  2. Power-of-Two Factor for LayerNorm: PTF is introduced as a solution to alleviate the issues arising from inter-channel variation. By equipping each channel with distinct power-of-two scaling factors, PTF reduces quantization errors, allowing computations to benefit from integer operations using BitShift, thereby maintaining the efficiency of layer-wise quantization.
  3. Log-Int-Softmax for Efficient Attention Map Quantization: The paper innovates with LIS to handle softmax quantization. Leveraging a log2 quantization approach and a novel integer-only logarithm technique, LIS allows the deployment of a 4-bit quantization scheme for attention maps, preserving competitive accuracy levels while enabling integer-only inference, significantly impacting inference speed and energy consumption.
  4. Comprehensive Performance Evaluation: The paper conducts extensive experiments on ViT, DeiT, and Swin Transformer models across benchmarks like ImageNet and COCO. The results indicate minimal accuracy degradation (∼1%) with fully quantized models, outperforming equivalents while reducing hardware overhead via lower bit-width storage and computation.

Theoretical and Practical Implications

Theoretically, this work advances the understanding of practical considerations when quantizing attention mechanisms and normalization layers within transformer architectures. The introduction of PTF and LIS offers a framework that maintains competitive model performance while greatly diminishing computational costs, potentially encouraging further research to explore even more aggressive bit-reduction techniques.

Practically, by achieving near-lossless quantization, FQ-ViT holds promise for industrial applications, particularly those bound by stringent resource constraints such as real-time edge computing and mobile implementations. The integer-only frameworks introduced by PTF and LIS can inspire hardware design optimizations for future accelerator architectures to exploit these efficiency gains.

Future Developments

This paper sparks several avenues for future exploration. The concept of fully quantizing transformer architectures opens up the possibility of integrating these methodologies with other optimization strategies such as model pruning or NAS frameworks to further minimize model footprint. Moreover, investigating the generalization of these techniques to other Transformer architectures, such as those used in NLP tasks, could unveil additional applications and insights.

In conclusion, this paper provides an effective pathway for the quantization of vision transformer models, achieving efficiency without compromising on performance. With the FQ-ViT approach, the deployment of these models in diverse, computation-restricted environments becomes increasingly feasible, paving the way for broader applicability and innovation in the field of AI and computer vision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yang Lin (34 papers)
  2. Tianyu Zhang (110 papers)
  3. Peiqin Sun (5 papers)
  4. Zheng Li (326 papers)
  5. Shuchang Zhou (51 papers)
Citations (112)