FQ-PETR: INT8 Quantization for 3D Detectors

Updated 14 November 2025

The paper introduces FQ-PETR, an integer-only quantization framework that maintains near-floating-point accuracy (<1% drop) using techniques like QFPE, DULUT, and QANS.
It addresses key challenges by aligning the dynamic ranges of multimodal inputs and efficiently quantizing nonlinear operators such as Softmax, exp, and SiLU.
Empirical results demonstrate up to 4× latency improvement and 75% memory reduction on PETR-style models without requiring specialized hardware.

FQ-PETR (“Fully Quantized Position Embedding Transformation”) is an end-to-end integer-only quantization framework specifically designed for transformer-based multi-view 3D object detectors, notably PETR, PETRv2, StreamPETR, and MV2d. It achieves near-floating-point accuracy (within 1% mAP/NDS drop) at INT8 precision (W8A8 for both weights and activations) while reducing inference latency by up to 4× and memory consumption by 75%. FQ-PETR directly addresses fundamental pitfalls in quantizing PETR-style architectures: the substantial magnitude disparity between multimodal inputs (image features and positional embeddings) and the challenge of efficiently quantizing non-linear operators such as Softmax, exp, and SiLU—without requiring specialized hardware or major redesigns.

1. Underlying Architecture and Motivation

PETR and its derivatives implement a two-stage multi-view 3D detection pipeline: (1) a 2D image backbone (e.g., ResNet-DCN) extracts per-pixel features from multiple camera views, and (2) a transformer decoder fuses these features with “camera-ray” position embeddings (PEs) to predict oriented 3D bounding boxes in bird’s-eye view (BEV). For each image pixel $(u,v)$ and sampled depths, PETR computes normalized 3D coordinates, applies an inverse-sigmoid nonlinearity, and projects through an MLP to obtain PEs (dimension $D$ ), which are fused with the same-dimensional image features.

Standard quantization techniques, when naively applied to this pipeline, incur catastrophic accuracy degradation (e.g., over 20-mAP drop). The origin is twofold: (i) modality scale mismatch—PEs range as high as $\pm130$ versus image features at $\pm4$ , causing small-valued features to collapse into a handful of quantized bins; (ii) quantization of non-linear functions introduces large errors, especially for Softmax and exp in attention modules, and specialized hardware for these is unavailable or impractical for deployment.

2. Quantization-Friendly LiDAR-Ray Position Embedding (QFPE)

2.1 Canonical and QFPE Methodology

Traditional PETR-style PEs are generated by:

Sampling $N$ depths per pixel along the camera ray.
Mapping $(x,y,z)$ to $[0,1]^3$ , then computing inverse-sigmoid:

$\hat{\mathbf{v}} = \ln \left( \frac{\mathbf{v} + \epsilon}{1 - (\mathbf{v} + \epsilon)} \right), \quad \epsilon = 10^{-5}$

Passing $\hat{\mathbf{v}}$ through a 2-layer MLP:

$\mathrm{PE_{CR}} = W_2 \, \mathrm{ReLU}(W_1 \hat{\mathbf{v}} + b_1) + b_2$

This process amplifies vector magnitudes by $\sim11.5\times$ , yielding components in the $\pm130$ range.

QFPE replaces this with:

Single-point sampling at a LiDAR-prior depth (e.g., 30m), eliminating non-linear interpolations over $N$ depths and removing all logit transformations.
Anchor-based embedding: for each axis $\alpha \in \{x, y, z\}$ , discrete anchor locations $L_\alpha^1 < L_\alpha^2 < L_\alpha^3$ and trainable embeddings $E_\alpha^1, E_\alpha^2, E_\alpha^3$ are defined. For a sampled coordinate $p_\alpha \in [L_\alpha^i, L_\alpha^{i+1}]$ :

$e_\alpha = \frac{p_\alpha - L_\alpha^i}{L_\alpha^{i+1} - L_\alpha^i} E_\alpha^{i+1} + \frac{L_\alpha^{i+1} - p_\alpha}{L_\alpha^{i+1} - L_\alpha^i} E_\alpha^i$

The resulting $e_x$ , $e_y$ , $e_z$ are concatenated and passed through the MLP.

2.2 Scale Alignment

With regularization ( $\gamma \approx 0.8$ ), the theorem shows $\|e_\alpha\|_\infty \leq \gamma$ , and after MLP the QFPE embedding falls in the range $\pm29.7$ (originally $\pm127.3$ ), bringing it close to the image feature range ( $\pm4$ ). This $4.4\times$ decrease in dynamic range enables both modalities to be quantized to 8-bit integers within a single tensor scale without destructive loss of fidelity.

3. Dual-Lookup Table (DULUT) for Nonlinear Operators

3.1 Motivation

Efficient quantization of nonlinearities such as exp, SiLU, and GELU is critical, especially in transformer cross-attention. A direct implementation using uniform linear LUTs for all 256 input codes of INT8 is prohibitively large; non-uniform or neural-network-based LUTs further impose hardware complexity.

3.2 Cascaded Lookup Table Design

DULUT structures nonlinear approximations as two cascaded LUTs:

$i_q \xrightarrow{\mathrm{LUT}_1} \mathrm{idx} \xrightarrow{\mathrm{LUT}_2} o_q$

LUT $_1$ (size $m_1$ ): Maps the quantized input code to an intermediate index, non-uniformly concentrating resolution in regions of high curvature.
LUT $_2$ (size $m_2$ ): Holds corresponding linearly interpolated outputs.

Error on interval $[x_i, x_{i+1}]$ is bounded by:

$\max_{x \in [x_i, x_{i+1}]} |f(x) - P(x)| \leq \frac{(x_{i+1} - x_i)^2}{8} \max_{x \in [x_i, x_{i+1}]} |f''(x)| + \varepsilon_\mathrm{hw}$

where $\varepsilon_\mathrm{hw}$ accounts for blending discretization.

3.3 LUT Optimization

An iterative average relative error (ARE)-driven algorithm splits high-ARE intervals and merges low-AREs until global ARE falls below tolerance $\delta$ . Empirically, DULUT with (32, 32) entries for SiLU on INT8 achieves accuracy and error matching a single 256-entry LUT, but with far less hardware overhead (standard LUT and linear interpolation only).

3.4 Deployment

DULUT’s triton-style kernel requires two linear LUT lookups plus a single 8-bit interpolation, fully compatible with commodity compilers and integer hardware, with no custom comparator logic.

4. Quantization After Numerical Stabilization (QANS)

4.1 Softmax Distortion in Quantization

The input logits to cross-attention Softmax can span large positive and negative values. Direct quantization in INT8 introduces substantial distortion in the subsequent exp evaluation, shifting attention peaks and resulting in massive accuracy loss.

4.2 QANS Procedure

The QANS pipeline applies numerical stabilization before quantization by subtracting the maximum logit from each vector element:

$x_s = x - \max_j x_j, \quad x_s \leq 0$

Discrete candidate scales $s_i$ ( $s_i = i/2^{k-1}, k=8, N \approx 20$ ) are used for quantize-dequantize passes:

$\hat{x}_s^i = s_i \cdot \mathrm{clamp}(\mathrm{round}(x_s/s_i), -2^{k-1}, 2^{k-1}-1)$

The floating point and candidate quantized softmax are computed ( $p_f, p_q^i$ ), and the scale $s_{\hat\imath}$ minimizing the $L_1$ difference is retained for future inference:

$\hat\imath = \arg\min_i \|p_f - p_q^i\|_1$

This scale selection ensures logit distributions are tightly bounded, attention peaks are preserved, and near-lossless quantized attention is achieved.

5. Implementation Strategies and Empirical Results

5.1 Quantization Strategy

Weights and activations use symmetric 8-bit integer quantization (W8A8). Post-training quantization (PTQ) leverages 32 calibration images with per-tensor scales. No retraining is necessary except for QFPE, which is integrated into a brief FP32 fine-tuning phase.

Deployments use standard ONNX and Triton kernels, tested on NVIDIA RTX 4090 hardware. DULUT and QANS run via standard integer arithmetic and LUT primitives, requiring no custom accelerators or comparator trees.

5.2 Performance Benchmarks

Model	Precision	mAP / NDS	Latency (FPS)	Memory (GB)	Relative Latency
PETR (FP32)	FP32	31.42/36.11	7.1	4.8	1.0×
PETR (SQ PTQ)	INT8 (SQ PTQ)	20.67/29.32	—	—	—
PETR (QuaRot)	INT8 (QuaRot)	22.81/30.00	—	—	—
FQ-PETR	INT8 (W8A8+QANS)	31.46/37.19	27.6	1.3	3.9×
StreamPETR FP32	FP32	49.51/58.03	—	—	—
FQ-StreamPETR	INT8 (W8A8+QANS)	50.48/58.61	—	—	—

Vanilla PTQ (SQ, QuaRot) exhibits $>8$ –$10$ mAP or NDS drop, while FQ-PETR losses are consistently within $<1\%$ . Memory usage reductions of up to 75% and speedups of nearly $4\times$ are attained.

6. Analysis, Limitations, and Prospective Work

FQ-PETR’s core contributions—QFPE, DULUT, and QANS—collectively resolve both critical challenges to quantizing transformer-based multi-view 3D object detection:

Magnitude alignment by QFPE prevents feature collapse during quantization.
Efficient nonlinear approximations via DULUT avoid the need for exponential LUT memory or hardware comparators.
Robust quantized attention through QANS ensures softmax behavior is preserved, even under aggressive 8-bit quantization.

Limitations include the current focus on INT8 precision; mixed or lower bitwidth quantization is not yet addressed. QFPE requires a one-off FP32 retraining stage. Application has been validated on PETR-style multi-camera models and the nuScenes benchmark; generalization to other sensor configurations, datasets, and larger models is a prospective thrust. Potential future work includes integration of quantization-aware training (QAT), federated learning for privacy, and adaptation to large-scale vision transformer or LLM-style attention modules.

FQ-PETR establishes a scalable, integer-only, hardware-friendly quantization workflow for high-performance 3D perception models, providing an operational pathway for deployment in resource-constrained autonomous driving scenarios (Yu et al., 12 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection (2025)

Follow Topic

Get notified by email when new papers are published related to FQ-PETR.