Papers
Topics
Authors
Recent
Search
2000 character limit reached

Differentiable Weightless Controllers (DWCs)

Updated 8 December 2025
  • Differentiable Weightless Controllers (DWCs) are policy architectures that replace neural network multiply–accumulate layers with sparse Boolean lookup tables for interpretable and efficient continuous control.
  • They employ surrogate gradients and straight-through estimators to enable end-to-end gradient-based reinforcement learning, achieving competitive results on MuJoCo benchmarks.
  • Their FPGA-compatible design delivers ultra-low latency and nanojoule-level energy consumption, making them ideal for safety- and resource-critical applications.

Differentiable Weightless Controllers (DWCs) are a symbolic-differentiable policy architecture for continuous control that replaces the conventional multiply–accumulate layers of neural-network policies with sparse, discrete logic circuits. A DWC directly maps real-valued observations to actions using a cascade of data-adaptive thermometer encodings, sparsely connected Boolean lookup-table (LUT) layers, and lightweight continuous-action heads. While fully discrete at inference, DWCs are amenable to end-to-end, gradient-based reinforcement learning due to the adoption of surrogate gradients and straight-through estimators during training. DWCs can be synthesized into FPGA-compatible architectures with single- or few-cycle latency and operate at nanojoule-level energy cost per action. Their sparse, structured connectivity yields directly interpretable input-output relationships in policy networks, making them suited to safety- and resource-critical control applications (Kresse et al., 1 Dec 2025).

1. Architecture of Differentiable Weightless Controllers

A Differentiable Weightless Controller comprises three principal components:

  1. Thermometer Encoding: Each real-valued observation xjRx_j \in \mathbb{R}, using maintained running statistics (μj,σj)(\mu_j, \sigma_j), is first normalized and clipped: x^j=clip(xjμjσj,10,10)\hat{x}_j = \mathrm{clip}\left(\frac{x_j-\mu_j}{\sigma_j}, -10, 10\right). An odd number of thresholds BB are placed at stretched-Gaussian quantiles, such that for each m=1,,Bm=1,\ldots,B:

qm=mB,s=10Φ1(1/B),τj,m=sΦ1(qm)q_m = \frac{m}{B},\quad s = \frac{10}{|\Phi^{-1}(1/B)|},\quad \tau_{j,m} = s \Phi^{-1}(q_m)

with τj,1=10\tau_{j,1} = -10, τj,(B+1)/2=0\tau_{j,(B+1)/2} = 0, τj,B=10\tau_{j,B} = 10. The thermometer code is:

Ej(x^j)=[1{x^jτj,1},,1{x^jτj,B}]{0,1}BE_j(\hat x_j) = [\mathbf{1}\{\hat x_j\geq\tau_{j,1}\},\ldots, \mathbf{1}\{\hat x_j\geq\tau_{j,B}\}] \in \{0,1\}^B

The concatenation over all dind_\text{in} input dimensions produces b(0){0,1}dinBb^{(0)}\in\{0,1\}^{d_\text{in}B}.

  1. Boolean LUT Layers: Each of the LL hidden layers contains DD_\ell binary outputs b(){0,1}Db^{(\ell)} \in \{0,1\}^{D_\ell}. Every output bit is computed by a kk-input LUT implementing a Boolean function fj():{0,1}k{0,1}f_j^{(\ell)}:\{0,1\}^k\rightarrow\{0,1\}. Both the interconnect (the selection of which kk previous-layer bits to read) and the 2k2^k LUT entries are learned. Sparsity is inherent, as each output depends on only kk inputs.
  2. Continuous-Action Heads: Final-layer bits b(L)b^{(L)} are partitioned into dactd_\text{act} disjoint groups G1,,GdactG_1,\dots,G_{d_\text{act}}. For each action dimension dd:

zd=1GdiGdbi(L)12,ld=αdzd+βdz_d = \frac{1}{|G_d|}\sum_{i\in G_d}b^{(L)}_i - \frac{1}{2},\quad l_d = \alpha_d z_d + \beta_d

where αd=exp(αd,p)\alpha_d = \exp(\alpha_{d,p}), and βd\beta_d are learned by gradient descent. Stochastic policies (e.g., in Soft Actor-Critic) apply a tanh\tanh to bound the output.

At inference, the forward pass consists entirely of threshold tests, LUT lookups, popcounting, and an affine mapping per action dimension.

2. Differentiable Training Regime

Despite being fully discrete at inference, DWCs are trained by standard gradient-based reinforcement learning algorithms such as Soft Actor-Critic (SAC), DDPG, and PPO. There are two principal mechanisms for differentiability:

  • Surrogate Gradients for LUTs: The Extended Finite-Difference (EFD) estimator computes smoothed gradients for each LUT entry Tj()[u]T^{(\ell)}_j[u] by “perturbing” LUT addresses in Hamming-distance neighborhoods, yielding continuous proxies bj()/Tj()[u]\partial b^{(\ell)}_j /\partial T^{(\ell)}_j[u].
  • Straight-Through Interconnect Learning: The selection of input wires for each LUT is relaxed to continuous “attention” weights during training and binarized via a straight-through estimator in the forward pass.

All real-valued parameters (e.g., αd,p\alpha_{d,p}, βd\beta_d in the action heads) are updated through gradients derived from standard RL objectives, such as the SAC policy loss:

Jπ(θ)=EstD,atπθ[αlogπθ(atst)Qϕ(st,at)]J_\pi(\theta) = \mathbb{E}_{s_t \sim \mathcal{D}, a_t \sim \pi_\theta}\Big[\alpha\log\pi_\theta(a_t|s_t) - Q_\phi(s_t,a_t)\Big]

Thermometer encoding thresholds are fixed after computation of running statistics. Critic/value networks remain conventional floating-point architectures and are optimized using standard TD or KL-regularized losses.

3. Empirical Evaluation and Performance

DWCs have been evaluated on five MuJoCo continuous-control benchmarks (Ant-v4, HalfCheetah-v4, Hopper-v4, Humanoid-v4, Walker2d-v4) using SAC. The following table summarizes median returns (25th–75th percentile) over 10 seeds for DWCs, floating-point MLP baselines, and quantized (QAT) networks at D=1024D_\ell=1024, k=6k=6, B=63B=63:

Environment FP (256-node MLP) QAT (2–3 bit net) DWC (1024 LUTs, k=6k=6, B=63B=63)
Ant 5598[4253,5802]_{[4253,5802]} 4716[3888,4887]_{[3888,4887]} 5677[5516,5906]_{[5516,5906]}
HalfCheetah 11529[10113,11922]_{[10113,11922]} 10465[9622,10956]_{[9622,10956]} 7549[7097,7881]_{[7097,7881]}
Hopper 2797[2062,3349]_{[2062,3349]} 1931[1096,3270]_{[1096,3270]} 3120[2777,3386]_{[2777,3386]}
Humanoid 6186[5956,6650]_{[5956,6650]} 5954[5800,6054]_{[5800,6054]} 6141[5819,6605]_{[5819,6605]}
Walker2d 5044[4697,5194]_{[4697,5194]} 4656[4445,5019]_{[4445,5019]} 5025[4510,5196]_{[4510,5196]}

DWCs match or exceed floating-point and quantized baselines on all tasks except HalfCheetah, where agent returns remain limited by network capacity. Ablation studies over DD_\ell indicate that 256–512 LUTs per layer suffice to saturate performance on most tasks, with HalfCheetah requiring larger architectures (up to D=16384D_\ell=16\,384, B=255B=255) to achieve return regimes comparable to high-capacity MLPs.

4. Hardware Realization and Efficiency

DWCs compile naturally to efficient digital logic amenable to FPGA implementation. Synthesis on a Xilinx Artix-7 XC7A15T-1 at 100 MHz with two pipeline stages yields:

  • For D=1024D_\ell=1024: Approximately 3,200 LUT6s, 3,700 flip-flops, 0 BRAM, 0 DSP; latency: 2–3 cycles; throughput: 10810^8 actions/s; power: ~0.22 W; energy: ~2 nJ/action.
  • For D=256D_\ell=256: Approximately 800 LUTs, 500 flip-flops, 1-cycle latency; energy: ~1 nJ/action.

For comparison, optimized 3-bit quantized neural networks require DSPs and BRAMs and exhibit much greater inference latency (hundreds to hundreds of thousands of cycles) and higher energy consumption (0.01–0.0001 J/action).

At inference, a DWC is a pure logic-circuit cascade of threshold tests, LUT indexing, popcounts, and one small SRAM lookup per action, offering both high throughput and energy efficiency.

5. Robustness and Noise Sensitivity

Injecting Gaussian noise of standard deviation σ[0.1,0.5]\sigma\in[0.1,0.5] into normalized observations, DWCs exhibit robustness on par with or exceeding that of full-precision policies, and comparable to quantized networks. This suggests that the discrete, thresholded encoding may confer inherent resilience to moderate input perturbations, complementing the digital robustness advantages inherent to logic circuits.

6. Interpretability and Structural Sparsity

Due to the one-to-one correspondence between thermometer-encoded thresholds and binary wires, and the explicit sparse interconnect of each LUT, DWC connectivity is directly auditable. For every input dimension and threshold index, the number of downstream connections is countable. Empirical analyses reveal:

  • Task-relevant inputs (e.g., torso velocity in Humanoid) contribute considerably more connections than other dimensions.
  • Within each set of thresholds, most connectivity is concentrated near normalized zero (just above or below), underscoring the import of sign changes in sensor signals.

This enables a transparent, “feature-importance” interpretability mechanism, in stark contrast to the opacity of dense neural network policies.

7. Summary and Significance

Differentiable Weightless Controllers combine neural-like gradient-based training with logic-circuit-like inference, offering ultra-low-latency, energy-efficient policies with sparse, interpretable structure. DWCs provide competitive continuous-control performance, especially in safety- and resource-critical domains, and offer a uniquely direct method for understanding the mapping from input sensors to policy actions (Kresse et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differentiable Weightless Controllers (DWCs).