CascadeCNN: FPGA Acceleration via Quantisation Cascades

Updated 8 January 2026

The paper introduces a two-stage design that leverages a low-precision unit for rapid inference and a high-precision unit for accuracy recovery without retraining.
It details a dynamic fixed-point quantisation method with per-layer scaling to optimize bitwidths and balance speed with error constraints.
Empirical results show up to 55% speedup on FPGA platforms while maintaining target CNN accuracy through confidence-guided selective computation.

Quantisation cascades, as realized in CascadeCNN, are a methodology for accelerating CNN inference on FPGAs through the synergistic combination of aggressive quantisation and confidence-driven selective re-computation. CascadeCNN implements a two-stage architecture comprising independently quantised units and an intermediate confidence evaluation, enabling high-throughput execution while preserving target accuracy levels under user-specified constraints. This approach is fully automated, requires no retraining or access to the original training data, and adapts both quantisation bitwidths and hardware parameters to the target FPGA and task requirements (Kouris et al., 2018, Kouris et al., 2018).

1. Architectural Framework: Two-Stage Quantisation Cascade

CascadeCNN decomposes the inference pipeline into three logically distinct modules:

Low-Precision Unit (LPU): Executes the entire CNN using a very low bitwidth (typically 4 or 5 bits, uniform across all layers but with per-layer scaling). This stage achieves maximal throughput by packing multiple MACCs into LUTs and exploiting DSP packing (e.g., two 4-bit MACCs per DSP using guard-bit interleaving). However, the accuracy is intentionally degraded to serve as a computational filter (Kouris et al., 2018, Kouris et al., 2018).
Confidence Evaluation Unit (CEU): After the LPU produces softmax outputs, the CEU computes a tunable uncertainty metric based on a generalised Best-vs-Second-Best (gBvSB) margin. Inputs deemed “confident” terminate after the LPU, while “hard” samples are forwarded to the high-precision stage.
High-Precision Unit (HPU): Replicates the LPU topology but with a larger bitwidth (commonly 8 or 16 bits), restoring the accuracy of the original model for those samples that could not be confidently classified by the LPU.

The system dataflow is: inputs → LPU → CEU; outputs that pass the confidence threshold are accepted as final, the rest are recomputed from scratch by the HPU. Final predictions are merged in input order.

2. Quantisation Methodology and Search

CascadeCNN employs a dynamic fixed-point quantisation format with uniform wordlength but per-layer scaling to accommodate the varying dynamic ranges of weights and activations. Formal quantisation of a real-valued tensor element $x$ in layer $\ell$ is given by:

$Q(x) = \mathrm{round}(x \cdot 2^{N_w-1}) \cdot 2^{-(N_w-1)}$

where $N_w$ is the bitwidth and each layer $\ell$ has its own scaling factors for weights $S^{w}_\ell$ and activations $S^{a}_\ell$ . The quantisation search iteratively sweeps over candidate bitwidths and per-layer scaling, using a small evaluation set (≈200 samples) to empirically determine the accuracy surface. The smallest wordlength that gives acceptable accuracy (according to the user’s error budget) is selected for the HPU; for the LPU, an even more aggressive quantisation is chosen, maximizing throughput while still allowing the CEU to recover overall accuracy post hoc. This optimization does not require network retraining or original training set access (Kouris et al., 2018, Kouris et al., 2018).

3. Automated Toolflow and Hardware Generation

Given a trained CNN model, FPGA resource constraints (LUTs, DSPs, BRAM), an application-level error tolerance, and a small evaluation set, the CascadeCNN toolflow proceeds as follows:

Quantisation Search: For each candidate bitwidth, sweep per-layer scaling, quantify accuracy degradation, and select $N_w^L$ (LPU bitwidth) and $N_w^H$ (HPU bitwidth) that satisfy the throughput–accuracy trade-off.
Roofline-Guided Hardware Design-Space Exploration: For each bitwidth, enumerate feasible tile sizes for the matrix-matrix multiplication (MM) engine (parameters $T_R$ , $T_P$ , $T_C$ ), and use a roofline model to maximize throughput under resource constraints. This tuning is performed independently for the LPU and HPU (Kouris et al., 2018).
Confidence Evaluation Tuning: The CEU’s threshold parameters $(M, N, \mathrm{th})$ are optimized to minimize the fraction $\alpha$ of samples routed to the HPU while satisfying the user’s error constraints.
Bitstream Generation: After successful parameterization, the toolflow generates the combined bitstream for LPU, CEU, and HPU. Typical implementation relies on Vivado HLS for MM engine synthesis, and Vivado Design Suite for place-and-route; quantisation profiling often uses Python or Matlab.

The entire architecture is natively compatible with Caffe and TensorFlow models.

4. Run-Time Confidence Evaluation

The CEU operationalizes run-time classification uncertainty using the generalised Best-vs-Second-Best (gBvSB) margin:

$\mathrm{gBvSB}_{\langle M,N\rangle}(\mathbf{p}) = \sum_{i=1}^M p_i - \sum_{j=M+1}^{N} p_j$

where $\mathbf{p}$ is the sorted softmax output vector. A sample is deemed “confident” if $\mathrm{gBvSB}_{\langle M,N\rangle}(\mathbf{p}) \geq \mathrm{th}$ ; otherwise, it is forwarded to the HPU. All parameters of the CEU are tuned to ensure the composite cascade (LPU+CEU+HPU) does not exceed the designated application-level error tolerance. For classification tasks with well-behaved softmax outputs, this confidence gating incurs negligible computational overhead (essentially a comparator tree and look-up table), and delivers a principled trade-off between accuracy and throughput (Kouris et al., 2018, Kouris et al., 2018).

5. Hardware Mapping and Matrix Multiply Core

Both LPU and HPU utilize a parameterized matrix-multiply (MM) core, onto which all convolution and fully-connected layers (lowered to GEMMs via im2col and batch-tiling) are mapped. Performance and area scale with the number of PEs and the degree of tiling:

Each PE implements a cascade of $T_P$ multipliers and an adder-tree, with low $N_w$ favoring MACC packing in LUTs and, for $N_w \leq 5$ , exploiting DSP overloading through zero-padding.
The MM core is double-buffered to mask DRAM latency, and output is accumulated across looped tiling steps.
For both 4- and 8-bit datapaths, resource allocation is tailored to maximize overlap with the device's resource envelope for each unit, and partial reconfiguration or bitstream switching can provide flexible reuse of the hardware block between the LPU and HPU (Kouris et al., 2018).

6. Throughput–Accuracy Trade-Off and Empirical Results

Let $A(w)$ denote top-k accuracy at bitwidth $w$ , $T(w)$ the per-input latency, and $\alpha$ the fraction of inputs re-processed by the HPU. End-to-end latency and system speedup are given by:

$T_{\rm total} = T(w_{\rm LPU}) + \alpha \cdot T(w_{\rm HPU})$

$\mathrm{Speedup} = \frac{T(w_{\rm HPU})}{T_{\rm total}}$

Empirical evaluation using VGG-16 and AlexNet on Xilinx Zynq ZC706 and UltraScale+ ZCU102 platforms demonstrates the following:

VGG-16 (Top-5 error, ZC706, 200 MHz):
- Baseline 8-bit: 89.62% accuracy, 140 img/s
- CascadeCNN: LPU 4-bit (75.24% acc.), HPU 8-bit (89.62% acc.), $\alpha \approx 32\%$ , 220 img/s, +55% speedup
AlexNet (Top-5 error):
- Baseline 8-bit: 79.13% accuracy, 380 img/s
- CascadeCNN: LPU 4-bit (64.48% acc.), HPU 8-bit (79.13% acc.), $\alpha \approx 37\%$ , 563 img/s, +48% speedup
Resource Example (VGG-16, ZC706):

Unit	DSPs	LUTs (k)	BRAM
LPU (4b)	240	45	280
HPU (8b)	290	60	320
Budget	300	100	400

Performance density figures show up to 3.3× higher GOp/s per DSP over single-precision accelerators for AlexNet and up to 2.7× for VGG-16 under comparable accuracy constraints (Kouris et al., 2018).

7. Implementation, Reproducibility, and Limitations

LPU weights are derived by down-quantising the HPU weights at deployment, so no retraining or back-propagation is involved.
The synthesis flow utilizes industry-standard tools (Vivado HLS, Vivado Design Suite) and is reproducible by providing a compatible CNN model, a small evaluation set, resource budgets, and error tolerance; running the provided quantisation and design-space scripts yields the full cascade configuration.
CascadeCNN is most effective for classification tasks with “well-behaved” softmax outputs and requires an evaluation set consistent with application distribution for effective CEU tuning.
Extremely low bitwidths (e.g., <4 bits) may cause excessive loss of dynamic range for some layers, impairing LPU accuracy recovery by the CEU.

The methodological innovation of the quantisation cascade enables principled exploitation of the time–accuracy trade-off in resource-constrained environments, establishing a new benchmark for automated CNN inference acceleration on FPGAs without the need for network retraining (Kouris et al., 2018, Kouris et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

CascadeCNN: Pushing the performance limits of quantisation (2018)

CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quantisation Cascades for FPGA Acceleration (CascadeCNN).

CascadeCNN: FPGA Acceleration via Quantisation Cascades

1. Architectural Framework: Two-Stage Quantisation Cascade

2. Quantisation Methodology and Search

3. Automated Toolflow and Hardware Generation

4. Run-Time Confidence Evaluation

5. Hardware Mapping and Matrix Multiply Core

6. Throughput–Accuracy Trade-Off and Empirical Results

7. Implementation, Reproducibility, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CascadeCNN: FPGA Acceleration via Quantisation Cascades

1. Architectural Framework: Two-Stage Quantisation Cascade

2. Quantisation Methodology and Search

3. Automated Toolflow and Hardware Generation

4. Run-Time Confidence Evaluation

5. Hardware Mapping and Matrix Multiply Core

6. Throughput–Accuracy Trade-Off and Empirical Results

7. Implementation, Reproducibility, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research