CascadeCNN: FPGA Acceleration via Quantisation Cascades
- The paper introduces a two-stage design that leverages a low-precision unit for rapid inference and a high-precision unit for accuracy recovery without retraining.
- It details a dynamic fixed-point quantisation method with per-layer scaling to optimize bitwidths and balance speed with error constraints.
- Empirical results show up to 55% speedup on FPGA platforms while maintaining target CNN accuracy through confidence-guided selective computation.
Quantisation cascades, as realized in CascadeCNN, are a methodology for accelerating CNN inference on FPGAs through the synergistic combination of aggressive quantisation and confidence-driven selective re-computation. CascadeCNN implements a two-stage architecture comprising independently quantised units and an intermediate confidence evaluation, enabling high-throughput execution while preserving target accuracy levels under user-specified constraints. This approach is fully automated, requires no retraining or access to the original training data, and adapts both quantisation bitwidths and hardware parameters to the target FPGA and task requirements (Kouris et al., 2018, Kouris et al., 2018).
1. Architectural Framework: Two-Stage Quantisation Cascade
CascadeCNN decomposes the inference pipeline into three logically distinct modules:
- Low-Precision Unit (LPU): Executes the entire CNN using a very low bitwidth (typically 4 or 5 bits, uniform across all layers but with per-layer scaling). This stage achieves maximal throughput by packing multiple MACCs into LUTs and exploiting DSP packing (e.g., two 4-bit MACCs per DSP using guard-bit interleaving). However, the accuracy is intentionally degraded to serve as a computational filter (Kouris et al., 2018, Kouris et al., 2018).
- Confidence Evaluation Unit (CEU): After the LPU produces softmax outputs, the CEU computes a tunable uncertainty metric based on a generalised Best-vs-Second-Best (gBvSB) margin. Inputs deemed “confident” terminate after the LPU, while “hard” samples are forwarded to the high-precision stage.
- High-Precision Unit (HPU): Replicates the LPU topology but with a larger bitwidth (commonly 8 or 16 bits), restoring the accuracy of the original model for those samples that could not be confidently classified by the LPU.
The system dataflow is: inputs → LPU → CEU; outputs that pass the confidence threshold are accepted as final, the rest are recomputed from scratch by the HPU. Final predictions are merged in input order.
2. Quantisation Methodology and Search
CascadeCNN employs a dynamic fixed-point quantisation format with uniform wordlength but per-layer scaling to accommodate the varying dynamic ranges of weights and activations. Formal quantisation of a real-valued tensor element in layer is given by:
where is the bitwidth and each layer has its own scaling factors for weights and activations . The quantisation search iteratively sweeps over candidate bitwidths and per-layer scaling, using a small evaluation set (≈200 samples) to empirically determine the accuracy surface. The smallest wordlength that gives acceptable accuracy (according to the user’s error budget) is selected for the HPU; for the LPU, an even more aggressive quantisation is chosen, maximizing throughput while still allowing the CEU to recover overall accuracy post hoc. This optimization does not require network retraining or original training set access (Kouris et al., 2018, Kouris et al., 2018).
3. Automated Toolflow and Hardware Generation
Given a trained CNN model, FPGA resource constraints (LUTs, DSPs, BRAM), an application-level error tolerance, and a small evaluation set, the CascadeCNN toolflow proceeds as follows:
- Quantisation Search: For each candidate bitwidth, sweep per-layer scaling, quantify accuracy degradation, and select (LPU bitwidth) and (HPU bitwidth) that satisfy the throughput–accuracy trade-off.
- Roofline-Guided Hardware Design-Space Exploration: For each bitwidth, enumerate feasible tile sizes for the matrix-matrix multiplication (MM) engine (parameters , , ), and use a roofline model to maximize throughput under resource constraints. This tuning is performed independently for the LPU and HPU (Kouris et al., 2018).
- Confidence Evaluation Tuning: The CEU’s threshold parameters are optimized to minimize the fraction of samples routed to the HPU while satisfying the user’s error constraints.
- Bitstream Generation: After successful parameterization, the toolflow generates the combined bitstream for LPU, CEU, and HPU. Typical implementation relies on Vivado HLS for MM engine synthesis, and Vivado Design Suite for place-and-route; quantisation profiling often uses Python or Matlab.
The entire architecture is natively compatible with Caffe and TensorFlow models.
4. Run-Time Confidence Evaluation
The CEU operationalizes run-time classification uncertainty using the generalised Best-vs-Second-Best (gBvSB) margin:
where is the sorted softmax output vector. A sample is deemed “confident” if ; otherwise, it is forwarded to the HPU. All parameters of the CEU are tuned to ensure the composite cascade (LPU+CEU+HPU) does not exceed the designated application-level error tolerance. For classification tasks with well-behaved softmax outputs, this confidence gating incurs negligible computational overhead (essentially a comparator tree and look-up table), and delivers a principled trade-off between accuracy and throughput (Kouris et al., 2018, Kouris et al., 2018).
5. Hardware Mapping and Matrix Multiply Core
Both LPU and HPU utilize a parameterized matrix-multiply (MM) core, onto which all convolution and fully-connected layers (lowered to GEMMs via im2col and batch-tiling) are mapped. Performance and area scale with the number of PEs and the degree of tiling:
- Each PE implements a cascade of multipliers and an adder-tree, with low favoring MACC packing in LUTs and, for , exploiting DSP overloading through zero-padding.
- The MM core is double-buffered to mask DRAM latency, and output is accumulated across looped tiling steps.
- For both 4- and 8-bit datapaths, resource allocation is tailored to maximize overlap with the device's resource envelope for each unit, and partial reconfiguration or bitstream switching can provide flexible reuse of the hardware block between the LPU and HPU (Kouris et al., 2018).
6. Throughput–Accuracy Trade-Off and Empirical Results
Let denote top-k accuracy at bitwidth , the per-input latency, and the fraction of inputs re-processed by the HPU. End-to-end latency and system speedup are given by:
Empirical evaluation using VGG-16 and AlexNet on Xilinx Zynq ZC706 and UltraScale+ ZCU102 platforms demonstrates the following:
- VGG-16 (Top-5 error, ZC706, 200 MHz):
- Baseline 8-bit: 89.62% accuracy, 140 img/s
- CascadeCNN: LPU 4-bit (75.24% acc.), HPU 8-bit (89.62% acc.), , 220 img/s, +55% speedup
- AlexNet (Top-5 error):
- Baseline 8-bit: 79.13% accuracy, 380 img/s
- CascadeCNN: LPU 4-bit (64.48% acc.), HPU 8-bit (79.13% acc.), , 563 img/s, +48% speedup
- Resource Example (VGG-16, ZC706):
| Unit | DSPs | LUTs (k) | BRAM |
|---|---|---|---|
| LPU (4b) | 240 | 45 | 280 |
| HPU (8b) | 290 | 60 | 320 |
| Budget | 300 | 100 | 400 |
Performance density figures show up to 3.3× higher GOp/s per DSP over single-precision accelerators for AlexNet and up to 2.7× for VGG-16 under comparable accuracy constraints (Kouris et al., 2018).
7. Implementation, Reproducibility, and Limitations
- LPU weights are derived by down-quantising the HPU weights at deployment, so no retraining or back-propagation is involved.
- The synthesis flow utilizes industry-standard tools (Vivado HLS, Vivado Design Suite) and is reproducible by providing a compatible CNN model, a small evaluation set, resource budgets, and error tolerance; running the provided quantisation and design-space scripts yields the full cascade configuration.
- CascadeCNN is most effective for classification tasks with “well-behaved” softmax outputs and requires an evaluation set consistent with application distribution for effective CEU tuning.
- Extremely low bitwidths (e.g., <4 bits) may cause excessive loss of dynamic range for some layers, impairing LPU accuracy recovery by the CEU.
The methodological innovation of the quantisation cascade enables principled exploitation of the time–accuracy trade-off in resource-constrained environments, establishing a new benchmark for automated CNN inference acceleration on FPGAs without the need for network retraining (Kouris et al., 2018, Kouris et al., 2018).