Differentiable Weightless Controllers (DWCs)
- Differentiable Weightless Controllers (DWCs) are policy architectures that replace neural network multiply–accumulate layers with sparse Boolean lookup tables for interpretable and efficient continuous control.
- They employ surrogate gradients and straight-through estimators to enable end-to-end gradient-based reinforcement learning, achieving competitive results on MuJoCo benchmarks.
- Their FPGA-compatible design delivers ultra-low latency and nanojoule-level energy consumption, making them ideal for safety- and resource-critical applications.
Differentiable Weightless Controllers (DWCs) are a symbolic-differentiable policy architecture for continuous control that replaces the conventional multiply–accumulate layers of neural-network policies with sparse, discrete logic circuits. A DWC directly maps real-valued observations to actions using a cascade of data-adaptive thermometer encodings, sparsely connected Boolean lookup-table (LUT) layers, and lightweight continuous-action heads. While fully discrete at inference, DWCs are amenable to end-to-end, gradient-based reinforcement learning due to the adoption of surrogate gradients and straight-through estimators during training. DWCs can be synthesized into FPGA-compatible architectures with single- or few-cycle latency and operate at nanojoule-level energy cost per action. Their sparse, structured connectivity yields directly interpretable input-output relationships in policy networks, making them suited to safety- and resource-critical control applications (Kresse et al., 1 Dec 2025).
1. Architecture of Differentiable Weightless Controllers
A Differentiable Weightless Controller comprises three principal components:
- Thermometer Encoding: Each real-valued observation , using maintained running statistics , is first normalized and clipped: . An odd number of thresholds are placed at stretched-Gaussian quantiles, such that for each :
with , , . The thermometer code is:
The concatenation over all input dimensions produces .
- Boolean LUT Layers: Each of the hidden layers contains binary outputs . Every output bit is computed by a -input LUT implementing a Boolean function . Both the interconnect (the selection of which previous-layer bits to read) and the LUT entries are learned. Sparsity is inherent, as each output depends on only inputs.
- Continuous-Action Heads: Final-layer bits are partitioned into disjoint groups . For each action dimension :
where , and are learned by gradient descent. Stochastic policies (e.g., in Soft Actor-Critic) apply a to bound the output.
At inference, the forward pass consists entirely of threshold tests, LUT lookups, popcounting, and an affine mapping per action dimension.
2. Differentiable Training Regime
Despite being fully discrete at inference, DWCs are trained by standard gradient-based reinforcement learning algorithms such as Soft Actor-Critic (SAC), DDPG, and PPO. There are two principal mechanisms for differentiability:
- Surrogate Gradients for LUTs: The Extended Finite-Difference (EFD) estimator computes smoothed gradients for each LUT entry by “perturbing” LUT addresses in Hamming-distance neighborhoods, yielding continuous proxies .
- Straight-Through Interconnect Learning: The selection of input wires for each LUT is relaxed to continuous “attention” weights during training and binarized via a straight-through estimator in the forward pass.
All real-valued parameters (e.g., , in the action heads) are updated through gradients derived from standard RL objectives, such as the SAC policy loss:
Thermometer encoding thresholds are fixed after computation of running statistics. Critic/value networks remain conventional floating-point architectures and are optimized using standard TD or KL-regularized losses.
3. Empirical Evaluation and Performance
DWCs have been evaluated on five MuJoCo continuous-control benchmarks (Ant-v4, HalfCheetah-v4, Hopper-v4, Humanoid-v4, Walker2d-v4) using SAC. The following table summarizes median returns (25th–75th percentile) over 10 seeds for DWCs, floating-point MLP baselines, and quantized (QAT) networks at , , :
| Environment | FP (256-node MLP) | QAT (2–3 bit net) | DWC (1024 LUTs, , ) |
|---|---|---|---|
| Ant | 5598 | 4716 | 5677 |
| HalfCheetah | 11529 | 10465 | 7549 |
| Hopper | 2797 | 1931 | 3120 |
| Humanoid | 6186 | 5954 | 6141 |
| Walker2d | 5044 | 4656 | 5025 |
DWCs match or exceed floating-point and quantized baselines on all tasks except HalfCheetah, where agent returns remain limited by network capacity. Ablation studies over indicate that 256–512 LUTs per layer suffice to saturate performance on most tasks, with HalfCheetah requiring larger architectures (up to , ) to achieve return regimes comparable to high-capacity MLPs.
4. Hardware Realization and Efficiency
DWCs compile naturally to efficient digital logic amenable to FPGA implementation. Synthesis on a Xilinx Artix-7 XC7A15T-1 at 100 MHz with two pipeline stages yields:
- For : Approximately 3,200 LUT6s, 3,700 flip-flops, 0 BRAM, 0 DSP; latency: 2–3 cycles; throughput: actions/s; power: ~0.22 W; energy: ~2 nJ/action.
- For : Approximately 800 LUTs, 500 flip-flops, 1-cycle latency; energy: ~1 nJ/action.
For comparison, optimized 3-bit quantized neural networks require DSPs and BRAMs and exhibit much greater inference latency (hundreds to hundreds of thousands of cycles) and higher energy consumption (0.01–0.0001 J/action).
At inference, a DWC is a pure logic-circuit cascade of threshold tests, LUT indexing, popcounts, and one small SRAM lookup per action, offering both high throughput and energy efficiency.
5. Robustness and Noise Sensitivity
Injecting Gaussian noise of standard deviation into normalized observations, DWCs exhibit robustness on par with or exceeding that of full-precision policies, and comparable to quantized networks. This suggests that the discrete, thresholded encoding may confer inherent resilience to moderate input perturbations, complementing the digital robustness advantages inherent to logic circuits.
6. Interpretability and Structural Sparsity
Due to the one-to-one correspondence between thermometer-encoded thresholds and binary wires, and the explicit sparse interconnect of each LUT, DWC connectivity is directly auditable. For every input dimension and threshold index, the number of downstream connections is countable. Empirical analyses reveal:
- Task-relevant inputs (e.g., torso velocity in Humanoid) contribute considerably more connections than other dimensions.
- Within each set of thresholds, most connectivity is concentrated near normalized zero (just above or below), underscoring the import of sign changes in sensor signals.
This enables a transparent, “feature-importance” interpretability mechanism, in stark contrast to the opacity of dense neural network policies.
7. Summary and Significance
Differentiable Weightless Controllers combine neural-like gradient-based training with logic-circuit-like inference, offering ultra-low-latency, energy-efficient policies with sparse, interpretable structure. DWCs provide competitive continuous-control performance, especially in safety- and resource-critical domains, and offer a uniquely direct method for understanding the mapping from input sensors to policy actions (Kresse et al., 1 Dec 2025).