Papers
Topics
Authors
Recent
Search
2000 character limit reached

Differentiable Weightless Controllers (DWC)

Updated 6 February 2026
  • DWCs are symbolic-differentiable controllers that use discrete logic circuits to map continuous observations to actions with energy efficiency.
  • They employ thermometer-encoded inputs and sparse boolean lookup-table layers to achieve interpretable, gradient-optimized policy learning.
  • Empirical results on MuJoCo tasks demonstrate that DWCs match or exceed neural network policies while enabling low-latency, power-efficient FPGA deployments.

Differentiable Weightless Controllers (DWCs) are a symbolic-differentiable architecture for continuous control, designed to represent and learn policies as discrete logic circuits rather than traditional continuous neural networks. A DWC maps real-valued state observations to action outputs through thermometer-encoded input vectors, sparsely connected layers of boolean lookup tables (LUTs), and lightweight action heads. The system is fully amenable to end-to-end gradient-based optimization, while enabling direct compilation to FPGA-compatible circuits with minimal latency and exceptionally low energy per action. Empirical evidence demonstrates that DWCs can achieve returns competitive with both full-precision and quantized neural network policies on a range of MuJoCo continuous control tasks, with distinct properties of structural sparsity and interpretability (Kresse et al., 1 Dec 2025).

1. Input Encoding via Thermometer Codes

A DWC accepts an observation xRdinx \in \mathbb{R}^{d_{\rm in}} and first applies per-coordinate normalization and clipping: x^j=clip(xjμjσj, 10, 10),j=1,,din\hat x_j = \mathrm{clip}\left(\frac{x_j - \mu_j}{\sigma_j},\ -10,\ 10\right), \quad j = 1, \ldots, d_{\rm in} where μj,σj\mu_j, \sigma_j are running statistics. Each normalized value x^j\hat{x}_j is discretized using a BB-bit thermometer code: τm=sΦ1(qm),qm=mB, m=1,,B,s=10Φ1(1B)\tau_m = s\,\Phi^{-1}\left(q_m\right), \quad q_m = \tfrac{m}{B}, \ m=1,\ldots,B, \quad s = \frac{10}{\left|\Phi^{-1}(\tfrac{1}{B})\right|} with Φ1\Phi^{-1} as the inverse CDF of the standard normal. The thermometer code is: Ej(x^j)=[1{x^jτ1},1{x^jτ2},,1{x^jτB}]{0,1}BE_j(\hat{x}_j) = [\,\mathbf{1}\{\hat{x}_j \ge \tau_1\},\,\mathbf{1}\{\hat{x}_j \ge \tau_2\},\,\dots,\,\mathbf{1}\{\hat{x}_j \ge \tau_B\}\,] \in \{0,1\}^B Concatenating across all dimensions yields a binary vector b(0){0,1}D0b^{(0)} \in \{0,1\}^{D_0}, where D0=BdinD_0 = B d_{\rm in}.

2. Boolean Lookup-Table Layers

DWCs are composed of LL layers of boolean lookup tables. For each layer \ell, DD_\ell LUTs compute binary activations b(){0,1}Db^{(\ell)} \in \{0,1\}^{D_\ell}. Each LUT in layer +1\ell+1 selects kk inputs from the previous layer (learned indices (ci,1,,ci,k)(c_{i,1}, \ldots, c_{i,k})), forming an address vector: ai(+1)=(bci,1(),,bci,k()){0,1}ka^{(\ell+1)}_i = (b^{(\ell)}_{c_{i,1}},\,\ldots,\,b^{(\ell)}_{c_{i,k}}) \in \{0,1\}^k Each LUT ii stores a truth-table Ti{0,1}2kT_i \in \{0,1\}^{2^k}. The output is computed as: u=addr(ai(+1)),bi(+1)=Ti[u]u = \mathrm{addr}(a^{(\ell+1)}_i), \quad b^{(\ell+1)}_i = T_i[u] The sparsity and wiring pattern of interconnects is learned; straight-through estimators facilitate differentiability during training.

Surrogate gradients for discrete truth-table updates are computed using Extended Finite Difference (EFD), a method that perturbs input bits and measures the downstream loss change, aggregating weighted finite differences over the Hamming ball of addresses: LTi[u]^=Δ{0,1}k{0}w(Δ1)[L(b[aΔ])L(b[a])]\widehat{\frac{\partial L}{\partial T_i[u]}} = \sum_{\Delta\in\{0,1\}^k\setminus\{0\}} w(\|\Delta\|_1)\, [L(b[a \oplus \Delta]) - L(b[a])]

3. Action Head and Output Parameterization

The final binary layer b(L){0,1}DLb^{(L)} \in \{0,1\}^{D_L} is divided into dactd_{\rm act} disjoint groups G1,,GdactG_1,\ldots,G_{d_{\rm act}} (one per action dimension). For action dd: sd=1GdiGdbi(L)[0,1],zd=sd12[12,12]s_d = \frac{1}{|G_d|} \sum_{i \in G_d} b^{(L)}_i \in [0,1], \quad z_d = s_d - \tfrac{1}{2} \in [-\tfrac{1}{2}, \tfrac{1}{2}] A per-dimension affine transformation is then applied: d=αdzd+βd,αd=exp(αd,p), βdR\ell_d = \alpha_d\,z_d + \beta_d, \quad \alpha_d = \exp(\alpha_{d,p}),\ \beta_d \in \mathbb{R} For Soft Actor-Critic (SAC) policies, tanh(d)\tanh(\ell_d) yields the action mean. All αd,βd\alpha_d, \beta_d parameters are updated via standard gradients.

4. Training and Optimization Strategy

DWCs are trained end-to-end using the SAC algorithm. Policy parameters consist of all LUT truth-tables ({Ti}\{T_i\}), LUT input selectors ({ci,j}\{c_{i,j}\}), and output head parameters ({αd,p,βd}\{\alpha_{d,p}, \beta_d\}). The policy loss is: Jπ(θ)=EsD,  aπθ[αlogπθ(as)Qϕ(s,a)]J_\pi(\theta) = \mathbb{E}_{s \sim \mathcal{D},\; a \sim \pi_\theta} \left[\,\alpha\,\log\pi_\theta(a|s) - Q_\phi(s,a) \right] where α\alpha is entropy weight (automatically tuned), QϕQ_\phi are critics, and D\mathcal{D} is the replay buffer. Backpropagation flows through the differentiable action head, uniform popcount, LUT updates via EFD surrogate gradients, and LUT input selectors via straight-through estimation.

5. FPGA Deployment and Efficiency Characteristics

At deployment, all floating-point operations are replaced by discrete, hardware-compatible operations. For Xilinx Artix-7 XC7A15T FPGAs at 100 MHz, a two-layer DWC with D=1024D_\ell=1024 6-input LUTs per layer entails:

  • 3228 LUT6s and 1667 FFs used (no DSPs, no large BRAM, one small SRAM per action)
  • Two-stage pipeline for 100 MHz timing
  • Latency: 2–3 cycles ($20$–$30$ μs)
  • Throughput: 10810^8 actions/s
  • Energy per action: 2.2×109\approx 2.2 \times 10^{-9} J

Comparatively, 2–3 bit quantized neural policies require thousands to hundreds of thousands of cycles per action, throughput as low as 4.1×1034.1 \times 10^3 actions/s, and energy per action from 6.5×1086.5 \times 10^{-8} J up to 8×1058 \times 10^{-5} J.

6. Empirical Performance on Benchmark Tasks

DWCs were evaluated on Ant, HalfCheetah, Hopper, Humanoid, and Walker2d continuous control tasks in MuJoCo, each with 10 training seeds and evaluation over 1000 rollouts. Comparison was made to a 256-neuron full-precision MLP ("FP") and 2–3 bit quantized MLP ("Quant"):

Environment FP Return Quant Return DWC Return
Ant 5598 [4253,5802] 4717 [3888,4887] 5677 [5516,5906]
HalfCheetah 11529 [10113,11922] 10465 [9622,10956] 7549 [7097,7881]
Hopper 2797 [2062,3349] 1931 [1096,3270] 3120 [2777,3386]
Humanoid 6186 [5956,6650] 5954 [5800,6054] 6141 [5819,6605]
Walker2d 5044 [4697,5194] 4656 [4445,5019] 5025 [4509,5196]

Bracketed intervals indicate [25%, 75%] quantiles. For all tasks except HalfCheetah (which is capacity-limited), DWCs achieved returns equal or superior to both FP and quantized controllers. Capacity ablations show that for Ant, Hopper, Humanoid, and Walker2d, performance saturates at D=256D_\ell=256. On HalfCheetah, performance increases up to D=4096D_\ell=4096; a DWC with D=16,384D_\ell=16,384 and B=255B=255 matches full-precision returns. Further ablations on LUT arity (kk) and bit resolution (BB) indicate most tasks do not require high arity or fine bit-depth except for HalfCheetah.

7. Structural Sparsity and Interpretability

Sparsity and wiring patterns in DWCs provide interpretability absent from conventional neural policies. Analysis of first-layer LUT connectivity reveals:

  • On the din=376d_{\rm in}=376 dimensional Humanoid task, 275 dimensions receive zero connections, indicating they are ignored.
  • The most-connected dimensions correspond to torso-velocity, highly correlated with reward.
  • Per-threshold connectivity patterns across thermometer bits reveal a bimodal emphasis on thresholds just to the left and right of the normalized zero-point, highlighting informative value regions.

This direct inspection of wiring elucidates which input dimensions and thresholds influence decisions, addressing interpretability by exposing the sparse, symbolic logic structure inherent to the trained policy (Kresse et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differentiable Weightless Controllers (DWC).