Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors (2407.12075v1)

Published 16 Jul 2024 in cs.LG and cs.AI

Abstract: Binary Neural Networks (BNNs) enable efficient deep learning by saving on storage and computational costs. However, as the size of neural networks continues to grow, meeting computational requirements remains a challenge. In this work, we propose a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted neural networks. The method learns binary vectors (i.e. tiles) to populate each layer of a model via aggregation and reshaping operations. During inference, the method reuses a single tile per layer to represent the full tensor. We employ the approach to both fully-connected and convolutional layers, which make up the breadth of space in most neural architectures. Empirically, the approach achieves near fullprecision performance on a diverse range of architectures (CNNs, Transformers, MLPs) and tasks (classification, segmentation, and time series forecasting) with up to an 8x reduction in size compared to binary-weighted models. We provide two implementations for Tiled Bit Networks: 1) we deploy the model to a microcontroller to assess its feasibility in resource-constrained environments, and 2) a GPU-compatible inference kernel to facilitate the reuse of a single tile per layer in memory.

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors

Introduction

This paper presents a novel approach named Tiled Bit Networks (TBNs) aimed at achieving sub-bit neural network compression. TBNs learn binary sequences, referred to as tiles, which populate the layers of a neural network model through aggregations and reshaping operations. During inference, each layer of the neural network reuses a single tile, allowing significant reduction in memory and storage requirements. This method demonstrates efficiency across a variety of neural model architectures, including CNNs, Transformers, and MLPs, and it performs well on tasks such as image classification, segmentation, and time series forecasting.

Methodology

Layer-Wise Tiling

TBNs utilize a unique approach to quantization by learning binary tiles to fill the weights of a neural network model during training. This involves the aggregation and reshaping of the model's parameters into condensed binary vectors that serve as reusable tiles. Specifically, the method begins with standard full-precision weights. These weights are reshaped and aggregated to compute a set of binary values that are then binarized to form tiles.

The binary tiles are replicated to match the required dimensionality of the neural layers, thereby creating a sub-bit representation of the original parameters. This process is depicted as follows:

  1. Reshape the weight tensor into a p×qp \times q matrix.
  2. Aggregate the matrix to form a vector.
  3. Apply a binary threshold to create the tiles.
  4. Replicate the tiles to form the final binary tensor.

Tile-Wise Scalars

The performance of TBNs is further improved by applying scaling factors, α\alphas, to the binary tiles. Two primary methods are considered for calculating α\alpha: one uses the original weight tensor, and the other uses a separate parameter designed exclusively for this calculation. Additionally, α\alpha can be computed globally for an entire layer or locally for each tile within the layer.

Experimental Results

CNN Architectures

Experiments on CIFAR-10 and ImageNet illustrate the effectiveness of TBNs compared to existing sub-bit compression techniques such as SNN, MST, and Spark. TBNs achieve strong performance across various compression rates, often matching or exceeding the performance of binary-weight neural networks (BWNNs). Notably, TBNs can achieve up to 8x compression with negligible loss in accuracy for several architectures.

MLP-Based Architectures

The PointNet model, which is heavily reliant on fully-connected layers, demonstrates that TBNs can effectively compress MLPs. For classification, part segmentation, and semantic segmentation tasks, TBNs achieve performance close to full precision models, often surpassing existing binary models.

Transformers

Experiments on Vision Transformers and Time Series Transformers highlight that TBNs maintain high accuracy even under significant compression. Transformers, known for their reliance on fully-connected layers, benefit notably from TBNs' ability to achieve sub-bit compression without substantial loss in performance.

Practical Implementations

Microcontroller Deployment

A microcontroller implementation of TBNs demonstrates practical applicability in resource-constrained environments. Compared to BWNNs, TBNs provide a significant reduction in memory and storage requirements while maintaining similar inference speed.

GPU Inference Kernel

The TBN GPU kernel, implemented using the Triton library, allows for efficient inference with significant memory savings. For instance, the ImageNet ViT model sees a 2.8x reduction in peak memory usage when utilizing the TBN kernel. This illustrates the feasibility of deploying TBNs in high-performance computing environments, thereby extending their applicability.

Ablation Studies

Ablation studies reveal key insights into the impact of various hyperparameters on the performance of TBNs. Limiting tiling to layers above a certain size threshold (λ\lambda) proves crucial for maintaining model performance. Additionally, using a separate parameter for computing α\alpha scales and optimizing α\alpha on a per-tile basis also show marginal performance gains.

Implications and Future Work

The paper confirms that TBNs offer a versatile and effective method for compressing neural networks to sub-bit levels, broadening the potential for deploying AI models in constrained environments. Moving forward, the application of TBNs in contexts with both binary weights and activations presents an interesting avenue for research. Further areas of exploration include scaling the approach to LLMs and developing specialized convolutional kernels to fully harness the potential of TBNs.

In conclusion, TBNs represent a promising method for neural network compression, achieving sub-bit efficiency while preserving performance across a range of architectures and tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Matt Gorbett (5 papers)
  2. Hossein Shirazi (5 papers)
  3. Indrakshi Ray (9 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com