Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors
Introduction
This paper presents a novel approach named Tiled Bit Networks (TBNs) aimed at achieving sub-bit neural network compression. TBNs learn binary sequences, referred to as tiles, which populate the layers of a neural network model through aggregations and reshaping operations. During inference, each layer of the neural network reuses a single tile, allowing significant reduction in memory and storage requirements. This method demonstrates efficiency across a variety of neural model architectures, including CNNs, Transformers, and MLPs, and it performs well on tasks such as image classification, segmentation, and time series forecasting.
Methodology
Layer-Wise Tiling
TBNs utilize a unique approach to quantization by learning binary tiles to fill the weights of a neural network model during training. This involves the aggregation and reshaping of the model's parameters into condensed binary vectors that serve as reusable tiles. Specifically, the method begins with standard full-precision weights. These weights are reshaped and aggregated to compute a set of binary values that are then binarized to form tiles.
The binary tiles are replicated to match the required dimensionality of the neural layers, thereby creating a sub-bit representation of the original parameters. This process is depicted as follows:
- Reshape the weight tensor into a matrix.
- Aggregate the matrix to form a vector.
- Apply a binary threshold to create the tiles.
- Replicate the tiles to form the final binary tensor.
Tile-Wise Scalars
The performance of TBNs is further improved by applying scaling factors, s, to the binary tiles. Two primary methods are considered for calculating : one uses the original weight tensor, and the other uses a separate parameter designed exclusively for this calculation. Additionally, can be computed globally for an entire layer or locally for each tile within the layer.
Experimental Results
CNN Architectures
Experiments on CIFAR-10 and ImageNet illustrate the effectiveness of TBNs compared to existing sub-bit compression techniques such as SNN, MST, and Spark. TBNs achieve strong performance across various compression rates, often matching or exceeding the performance of binary-weight neural networks (BWNNs). Notably, TBNs can achieve up to 8x compression with negligible loss in accuracy for several architectures.
MLP-Based Architectures
The PointNet model, which is heavily reliant on fully-connected layers, demonstrates that TBNs can effectively compress MLPs. For classification, part segmentation, and semantic segmentation tasks, TBNs achieve performance close to full precision models, often surpassing existing binary models.
Transformers
Experiments on Vision Transformers and Time Series Transformers highlight that TBNs maintain high accuracy even under significant compression. Transformers, known for their reliance on fully-connected layers, benefit notably from TBNs' ability to achieve sub-bit compression without substantial loss in performance.
Practical Implementations
Microcontroller Deployment
A microcontroller implementation of TBNs demonstrates practical applicability in resource-constrained environments. Compared to BWNNs, TBNs provide a significant reduction in memory and storage requirements while maintaining similar inference speed.
GPU Inference Kernel
The TBN GPU kernel, implemented using the Triton library, allows for efficient inference with significant memory savings. For instance, the ImageNet ViT model sees a 2.8x reduction in peak memory usage when utilizing the TBN kernel. This illustrates the feasibility of deploying TBNs in high-performance computing environments, thereby extending their applicability.
Ablation Studies
Ablation studies reveal key insights into the impact of various hyperparameters on the performance of TBNs. Limiting tiling to layers above a certain size threshold () proves crucial for maintaining model performance. Additionally, using a separate parameter for computing scales and optimizing on a per-tile basis also show marginal performance gains.
Implications and Future Work
The paper confirms that TBNs offer a versatile and effective method for compressing neural networks to sub-bit levels, broadening the potential for deploying AI models in constrained environments. Moving forward, the application of TBNs in contexts with both binary weights and activations presents an interesting avenue for research. Further areas of exploration include scaling the approach to LLMs and developing specialized convolutional kernels to fully harness the potential of TBNs.
In conclusion, TBNs represent a promising method for neural network compression, achieving sub-bit efficiency while preserving performance across a range of architectures and tasks.