Differentiable Weightless Controllers (DWC)
- DWCs are symbolic-differentiable controllers that use discrete logic circuits to map continuous observations to actions with energy efficiency.
- They employ thermometer-encoded inputs and sparse boolean lookup-table layers to achieve interpretable, gradient-optimized policy learning.
- Empirical results on MuJoCo tasks demonstrate that DWCs match or exceed neural network policies while enabling low-latency, power-efficient FPGA deployments.
Differentiable Weightless Controllers (DWCs) are a symbolic-differentiable architecture for continuous control, designed to represent and learn policies as discrete logic circuits rather than traditional continuous neural networks. A DWC maps real-valued state observations to action outputs through thermometer-encoded input vectors, sparsely connected layers of boolean lookup tables (LUTs), and lightweight action heads. The system is fully amenable to end-to-end gradient-based optimization, while enabling direct compilation to FPGA-compatible circuits with minimal latency and exceptionally low energy per action. Empirical evidence demonstrates that DWCs can achieve returns competitive with both full-precision and quantized neural network policies on a range of MuJoCo continuous control tasks, with distinct properties of structural sparsity and interpretability (Kresse et al., 1 Dec 2025).
1. Input Encoding via Thermometer Codes
A DWC accepts an observation and first applies per-coordinate normalization and clipping: where are running statistics. Each normalized value is discretized using a -bit thermometer code: with as the inverse CDF of the standard normal. The thermometer code is: Concatenating across all dimensions yields a binary vector , where .
2. Boolean Lookup-Table Layers
DWCs are composed of layers of boolean lookup tables. For each layer , LUTs compute binary activations . Each LUT in layer selects inputs from the previous layer (learned indices ), forming an address vector: Each LUT stores a truth-table . The output is computed as: The sparsity and wiring pattern of interconnects is learned; straight-through estimators facilitate differentiability during training.
Surrogate gradients for discrete truth-table updates are computed using Extended Finite Difference (EFD), a method that perturbs input bits and measures the downstream loss change, aggregating weighted finite differences over the Hamming ball of addresses:
3. Action Head and Output Parameterization
The final binary layer is divided into disjoint groups (one per action dimension). For action : A per-dimension affine transformation is then applied: For Soft Actor-Critic (SAC) policies, yields the action mean. All parameters are updated via standard gradients.
4. Training and Optimization Strategy
DWCs are trained end-to-end using the SAC algorithm. Policy parameters consist of all LUT truth-tables (), LUT input selectors (), and output head parameters (). The policy loss is: where is entropy weight (automatically tuned), are critics, and is the replay buffer. Backpropagation flows through the differentiable action head, uniform popcount, LUT updates via EFD surrogate gradients, and LUT input selectors via straight-through estimation.
5. FPGA Deployment and Efficiency Characteristics
At deployment, all floating-point operations are replaced by discrete, hardware-compatible operations. For Xilinx Artix-7 XC7A15T FPGAs at 100 MHz, a two-layer DWC with 6-input LUTs per layer entails:
- 3228 LUT6s and 1667 FFs used (no DSPs, no large BRAM, one small SRAM per action)
- Two-stage pipeline for 100 MHz timing
- Latency: 2–3 cycles ($20$–$30$ μs)
- Throughput: actions/s
- Energy per action: J
Comparatively, 2–3 bit quantized neural policies require thousands to hundreds of thousands of cycles per action, throughput as low as actions/s, and energy per action from J up to J.
6. Empirical Performance on Benchmark Tasks
DWCs were evaluated on Ant, HalfCheetah, Hopper, Humanoid, and Walker2d continuous control tasks in MuJoCo, each with 10 training seeds and evaluation over 1000 rollouts. Comparison was made to a 256-neuron full-precision MLP ("FP") and 2–3 bit quantized MLP ("Quant"):
| Environment | FP Return | Quant Return | DWC Return |
|---|---|---|---|
| Ant | 5598 [4253,5802] | 4717 [3888,4887] | 5677 [5516,5906] |
| HalfCheetah | 11529 [10113,11922] | 10465 [9622,10956] | 7549 [7097,7881] |
| Hopper | 2797 [2062,3349] | 1931 [1096,3270] | 3120 [2777,3386] |
| Humanoid | 6186 [5956,6650] | 5954 [5800,6054] | 6141 [5819,6605] |
| Walker2d | 5044 [4697,5194] | 4656 [4445,5019] | 5025 [4509,5196] |
Bracketed intervals indicate [25%, 75%] quantiles. For all tasks except HalfCheetah (which is capacity-limited), DWCs achieved returns equal or superior to both FP and quantized controllers. Capacity ablations show that for Ant, Hopper, Humanoid, and Walker2d, performance saturates at . On HalfCheetah, performance increases up to ; a DWC with and matches full-precision returns. Further ablations on LUT arity () and bit resolution () indicate most tasks do not require high arity or fine bit-depth except for HalfCheetah.
7. Structural Sparsity and Interpretability
Sparsity and wiring patterns in DWCs provide interpretability absent from conventional neural policies. Analysis of first-layer LUT connectivity reveals:
- On the dimensional Humanoid task, 275 dimensions receive zero connections, indicating they are ignored.
- The most-connected dimensions correspond to torso-velocity, highly correlated with reward.
- Per-threshold connectivity patterns across thermometer bits reveal a bimodal emphasis on thresholds just to the left and right of the normalized zero-point, highlighting informative value regions.
This direct inspection of wiring elucidates which input dimensions and thresholds influence decisions, addressing interpretability by exposing the sparse, symbolic logic structure inherent to the trained policy (Kresse et al., 1 Dec 2025).