Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rotated Damped Fourier Surrogate (RDFS)

Updated 2 February 2026
  • RDFS is a differentiable surrogate operator derived from discrete Fourier analysis that replaces the non-differentiable rounding function in quantization-aware training.
  • It rotates the rounding function by 45° and applies amplitude damping to yield smooth and bounded gradients for robust ultra-low bitwidth optimization.
  • The method guarantees L₂-optimality and stable gradient variance, outperforming standard STE and DSQ techniques in challenging quantization regimes.

The Rotated Damped Fourier Surrogate (RDFS) is a differentiable surrogate operator designed for quantization-aware training (QAT) at ultra-low bitwidths. Introduced in the StableQAT framework, RDFS enables stable and robust optimization by providing smooth, bounded, and computationally efficient gradients for the non-differentiable rounding operation that underpins quantization. Unlike the widely used straight-through estimator (STE) or soft quantizer gradients, RDFS is derived from a discrete Fourier analysis of the rounding operator after a canonical (45°) rotation and subsequent amplitude damping, yielding a plug-and-play surrogate with provable optimality and stability guarantees (Chen et al., 27 Jan 2026).

1. Formal Definition and Forward/Backward Workflow

RDFS precisely replaces the gradient of the rounding operation with a theoretically motivated Fourier-based function. In the forward pass, quantization proceeds as standard:

  • xq=clip(round(x),qmin,qmax)x_q = \text{clip}(\text{round}(x), q_{\min}, q_{\max})

For the backward pass, the surrogate gradient replaces the non-differentiable step response:

  • xqxg(x,xq)\frac{\partial x_q}{\partial x} \approx g(x, x_q) where

g(x,xq)=1A2πm=0M(1)m2m+1cos((2m+1)π(x+xq))1+A2πm=0M(1)m2m+1cos((2m+1)π(x+xq))g(x, x_q) = \frac{1 - A\sqrt{2}\pi \sum_{m=0}^M \frac{(-1)^m}{2m+1} \cos\left((2m+1)\pi(x + x_q)\right)} {1 + A\sqrt{2}\pi \sum_{m=0}^M \frac{(-1)^m}{2m+1} \cos\left((2m+1)\pi(x + x_q)\right)}

with MM as the Fourier order (typically 0), AA the amplitude (A0.225A \lesssim 0.225), and the rotation angle fixed at 45°. The default configuration uses M=0M=0, A0.21A\approx 0.21.

2. Mathematical Derivation and Transformation Rationale

The formulation of RDFS follows a specific analytic pathway:

  • The rounding staircase function is rotated by 45°, transforming the piecewise constant into a centered triangle wave.
  • In the rotated basis, coordinates t=(x+xq)/2t = (x + x_q)/\sqrt{2}, and f(t)=(x+xq)/2f(t) = (-x + x_q)/\sqrt{2}, the staircase is expressed as a periodic triangle wave:

f(t)=122(14r(t)1/2)f(t) = \frac{1}{2\sqrt{2}(1 - 4|r(t) - 1/2|)}

with period T=2T = \sqrt{2} and r(t)r(t) the fractional part operator.

  • The triangle wave f(t)f(t) admits a Fourier expansion:

f(t)Am=0(1)m(2m+1)2sin((2m+1)2πt)f(t) \approx -A \sum_{m=0}^\infty \frac{(-1)^m}{(2m+1)^2} \sin\left( (2m+1)\sqrt{2}\pi t \right)

  • Differentiating leads to:

f(t)=A2πm=0(1)m2m+1cos((2m+1)2πt)f'(t) = -A\sqrt{2}\pi \sum_{m=0}^\infty \frac{(-1)^m}{2m+1} \cos\left( (2m+1)\sqrt{2}\pi t \right)

  • Inverse rotation and the chain rule yield:

xqx=1+f(t)1f(t)\frac{\partial x_q}{\partial x} = \frac{1 + f'(t)}{1 - f'(t)}

Substituting t=(x+xq)/2t = (x+x_q)/\sqrt{2} and truncating the Fourier expansion at m=Mm = M recovers the RDFS formula.

3. Properties and Practical Implementation

Surrogate Activation and Its Bounds

With M=0M=0, the RDFS gradient activation simplifies:

g(x,xq)=1A2πcos(π(x+xq))1+A2πcos(π(x+xq))g(x, x_q) = \frac{1 - A\sqrt{2}\pi\cos(\pi(x + x_q))}{1 + A\sqrt{2}\pi\cos(\pi(x + x_q))}

  • For c=A2π<1c = A\sqrt{2}\pi < 1,

g(x,xq)[1c1+c,1+c1c]g(x, x_q) \in \left[ \frac{1-c}{1+c}, \frac{1+c}{1-c} \right]

  • The function is CC^\infty in xx (except at quantization clipping), providing smooth gradients throughout the valid range.
  • This form ensures gradients are both bounded and non-exploding.

Algorithmic Loop and Pseudocode

A typical QAT iteration using RDFS proceeds as follows:

  • Hyperparameters: A<1/(2π)A < 1/(\sqrt{2}\pi), MM (usually 0), quantization bits bb
  • Forward: xq=clip(round(x/s)s)x_q = \text{clip}(\text{round}(x/s) \cdot s), for scale ss
  • Backward: For each x,xqx, x_q,
    • Compute g(x,xq)g(x, x_q) as above
    • Lx=(Lxq)g\frac{\partial L}{\partial x} = (\frac{\partial L}{\partial x_q}) \cdot g
  • Update: Standard optimizer step

4. Theoretical Guarantees and Comparative Analysis

RDFS is characterized by several formal properties:

Property RDFS STE/DSQ
L₂-Optimality Unique L₂-minimizer among trigonometric surrogates of degree M\le M (Thm 5.1) STE is a constant, suboptimal case
Gradient Variance as A1/(2π)A\to 1/(\sqrt{2}\pi) Bounded: 16/(3π)16/π20.07616/(3\pi) - 16/\pi^2 \approx 0.076 (Thm 5.2) DSQ diverges; variance explodes as sharpness α0\alpha\to0
Gradient Boundedness Uniformly bounded in [(1c)/(1+c),(1+c)/(1c)][(1-c)/(1+c), (1+c)/(1-c)] STE is always 1; DSQ unbounded for sharp α\alpha
Degeneracy Reduces to STE for A0A \to 0 STE only recovers constant

The L₂-optimality of RDFS ensures minimal surrogate mismatch; the bounded variance property leads to stable training dynamics without sharp gradient spikes encountered in soft quantizers like DSQ at high sharpness regimes (Chen et al., 27 Jan 2026).

5. Hyperparameter Selection and Empirical Recommendations

Empirical guidelines for applying RDFS in low-bitwidth QAT scenarios are as follows:

  • Always prefer first-order (M=0M=0) for maximal benefit at near-zero computational cost.
  • Select amplitude AA in [0.15,0.25][0.15, 0.25]; a canonical setting is A=0.21A=0.21, remaining safely below the ill-conditioned regime at A=1/(2π)0.225A=1/(\sqrt{2}\pi) \approx 0.225.
  • The rotation angle is fixed at $45°$; tuning is not required.
  • Limited amplitude sweeps in the recommended range are beneficial; the vicinity of A1/(2π)A \approx 1/(\sqrt{2}\pi) should be specifically avoided due to instability.

6. Significance in Quantization-Aware Training

RDFS delivers a strictly more expressive and robust alternative to STE and soft quantizers for QAT at 2–4 bits. The bounded, smooth, and nearly cost-free gradient construction enables plug-and-play integration into existing QAT pipelines with theoretically guaranteed optimization properties. The framework generalizes STE and uniquely achieves L₂-optimality (for a given order MM) among trigonometric surrogates, with proven uniform gradient bounds and stable variance—even under extreme surrogate sharpness (Chen et al., 27 Jan 2026).

A plausible implication is that for ultra-low bitwidth regimes, where gradient mismatch and instability are pronounced, RDFS enables consistent convergence and improved quantization robustness with negligible additional overhead. This is particularly consequential for deploying large-scale models under aggressive memory and latency constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rotated Damped Fourier Surrogate (RDFS).