Differentiable Weightless Controllers (DWC)

Updated 6 February 2026

DWCs are symbolic-differentiable controllers that use discrete logic circuits to map continuous observations to actions with energy efficiency.
They employ thermometer-encoded inputs and sparse boolean lookup-table layers to achieve interpretable, gradient-optimized policy learning.
Empirical results on MuJoCo tasks demonstrate that DWCs match or exceed neural network policies while enabling low-latency, power-efficient FPGA deployments.

Differentiable Weightless Controllers (DWCs) are a symbolic-differentiable architecture for continuous control, designed to represent and learn policies as discrete logic circuits rather than traditional continuous neural networks. A DWC maps real-valued state observations to action outputs through thermometer-encoded input vectors, sparsely connected layers of boolean lookup tables (LUTs), and lightweight action heads. The system is fully amenable to end-to-end gradient-based optimization, while enabling direct compilation to FPGA-compatible circuits with minimal latency and exceptionally low energy per action. Empirical evidence demonstrates that DWCs can achieve returns competitive with both full-precision and quantized neural network policies on a range of MuJoCo continuous control tasks, with distinct properties of structural sparsity and interpretability (Kresse et al., 1 Dec 2025).

1. Input Encoding via Thermometer Codes

A DWC accepts an observation $x \in \mathbb{R}^{d_{\rm in}}$ and first applies per-coordinate normalization and clipping: $\hat x_j = \mathrm{clip}\left(\frac{x_j - \mu_j}{\sigma_j},\ -10,\ 10\right), \quad j = 1, \ldots, d_{\rm in}$ where $\mu_j, \sigma_j$ are running statistics. Each normalized value $\hat{x}_j$ is discretized using a $B$ -bit thermometer code: $\tau_m = s\,\Phi^{-1}\left(q_m\right), \quad q_m = \tfrac{m}{B}, \ m=1,\ldots,B, \quad s = \frac{10}{\left|\Phi^{-1}(\tfrac{1}{B})\right|}$ with $\Phi^{-1}$ as the inverse CDF of the standard normal. The thermometer code is: $E_j(\hat{x}_j) = [\,\mathbf{1}\{\hat{x}_j \ge \tau_1\},\,\mathbf{1}\{\hat{x}_j \ge \tau_2\},\,\dots,\,\mathbf{1}\{\hat{x}_j \ge \tau_B\}\,] \in \{0,1\}^B$ Concatenating across all dimensions yields a binary vector $b^{(0)} \in \{0,1\}^{D_0}$ , where $D_0 = B d_{\rm in}$ .

2. Boolean Lookup-Table Layers

DWCs are composed of $L$ layers of boolean lookup tables. For each layer $\ell$ , $D_\ell$ LUTs compute binary activations $b^{(\ell)} \in \{0,1\}^{D_\ell}$ . Each LUT in layer $\ell+1$ selects $k$ inputs from the previous layer (learned indices $(c_{i,1}, \ldots, c_{i,k})$ ), forming an address vector: $a^{(\ell+1)}_i = (b^{(\ell)}_{c_{i,1}},\,\ldots,\,b^{(\ell)}_{c_{i,k}}) \in \{0,1\}^k$ Each LUT $i$ stores a truth-table $T_i \in \{0,1\}^{2^k}$ . The output is computed as: $u = \mathrm{addr}(a^{(\ell+1)}_i), \quad b^{(\ell+1)}_i = T_i[u]$ The sparsity and wiring pattern of interconnects is learned; straight-through estimators facilitate differentiability during training.

Surrogate gradients for discrete truth-table updates are computed using Extended Finite Difference (EFD), a method that perturbs input bits and measures the downstream loss change, aggregating weighted finite differences over the Hamming ball of addresses: $\widehat{\frac{\partial L}{\partial T_i[u]}} = \sum_{\Delta\in\{0,1\}^k\setminus\{0\}} w(\|\Delta\|_1)\, [L(b[a \oplus \Delta]) - L(b[a])]$

3. Action Head and Output Parameterization

The final binary layer $b^{(L)} \in \{0,1\}^{D_L}$ is divided into $d_{\rm act}$ disjoint groups $G_1,\ldots,G_{d_{\rm act}}$ (one per action dimension). For action $d$ : $s_d = \frac{1}{|G_d|} \sum_{i \in G_d} b^{(L)}_i \in [0,1], \quad z_d = s_d - \tfrac{1}{2} \in [-\tfrac{1}{2}, \tfrac{1}{2}]$ A per-dimension affine transformation is then applied: $\ell_d = \alpha_d\,z_d + \beta_d, \quad \alpha_d = \exp(\alpha_{d,p}),\ \beta_d \in \mathbb{R}$ For Soft Actor-Critic (SAC) policies, $\tanh(\ell_d)$ yields the action mean. All $\alpha_d, \beta_d$ parameters are updated via standard gradients.

4. Training and Optimization Strategy

DWCs are trained end-to-end using the SAC algorithm. Policy parameters consist of all LUT truth-tables ( $\{T_i\}$ ), LUT input selectors ( $\{c_{i,j}\}$ ), and output head parameters ( $\{\alpha_{d,p}, \beta_d\}$ ). The policy loss is: $J_\pi(\theta) = \mathbb{E}_{s \sim \mathcal{D},\; a \sim \pi_\theta} \left[\,\alpha\,\log\pi_\theta(a|s) - Q_\phi(s,a) \right]$ where $\alpha$ is entropy weight (automatically tuned), $Q_\phi$ are critics, and $\mathcal{D}$ is the replay buffer. Backpropagation flows through the differentiable action head, uniform popcount, LUT updates via EFD surrogate gradients, and LUT input selectors via straight-through estimation.

5. FPGA Deployment and Efficiency Characteristics

At deployment, all floating-point operations are replaced by discrete, hardware-compatible operations. For Xilinx Artix-7 XC7A15T FPGAs at 100 MHz, a two-layer DWC with $D_\ell=1024$ 6-input LUTs per layer entails:

3228 LUT6s and 1667 FFs used (no DSPs, no large BRAM, one small SRAM per action)
Two-stage pipeline for 100 MHz timing
Latency: 2–3 cycles ($20$–$30$ μs)
Throughput: $10^8$ actions/s
Energy per action: $\approx 2.2 \times 10^{-9}$ J

Comparatively, 2–3 bit quantized neural policies require thousands to hundreds of thousands of cycles per action, throughput as low as $4.1 \times 10^3$ actions/s, and energy per action from $6.5 \times 10^{-8}$ J up to $8 \times 10^{-5}$ J.

6. Empirical Performance on Benchmark Tasks

DWCs were evaluated on Ant, HalfCheetah, Hopper, Humanoid, and Walker2d continuous control tasks in MuJoCo, each with 10 training seeds and evaluation over 1000 rollouts. Comparison was made to a 256-neuron full-precision MLP ("FP") and 2–3 bit quantized MLP ("Quant"):

Environment	FP Return	Quant Return	DWC Return
Ant	5598 [4253,5802]	4717 [3888,4887]	5677 [5516,5906]
HalfCheetah	11529 [10113,11922]	10465 [9622,10956]	7549 [7097,7881]
Hopper	2797 [2062,3349]	1931 [1096,3270]	3120 [2777,3386]
Humanoid	6186 [5956,6650]	5954 [5800,6054]	6141 [5819,6605]
Walker2d	5044 [4697,5194]	4656 [4445,5019]	5025 [4509,5196]

Bracketed intervals indicate [25%, 75%] quantiles. For all tasks except HalfCheetah (which is capacity-limited), DWCs achieved returns equal or superior to both FP and quantized controllers. Capacity ablations show that for Ant, Hopper, Humanoid, and Walker2d, performance saturates at $D_\ell=256$ . On HalfCheetah, performance increases up to $D_\ell=4096$ ; a DWC with $D_\ell=16,384$ and $B=255$ matches full-precision returns. Further ablations on LUT arity ( $k$ ) and bit resolution ( $B$ ) indicate most tasks do not require high arity or fine bit-depth except for HalfCheetah.

7. Structural Sparsity and Interpretability

Sparsity and wiring patterns in DWCs provide interpretability absent from conventional neural policies. Analysis of first-layer LUT connectivity reveals:

On the $d_{\rm in}=376$ dimensional Humanoid task, 275 dimensions receive zero connections, indicating they are ignored.
The most-connected dimensions correspond to torso-velocity, highly correlated with reward.
Per-threshold connectivity patterns across thermometer bits reveal a bimodal emphasis on thresholds just to the left and right of the normalized zero-point, highlighting informative value regions.

This direct inspection of wiring elucidates which input dimensions and thresholds influence decisions, addressing interpretability by exposing the sparse, symbolic logic structure inherent to the trained policy (Kresse et al., 1 Dec 2025).

Markdown Upgrade to Chat

References (1)

Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differentiable Weightless Controllers (DWC).