NUDGE: Neurally Guided Logic Policies

Updated 24 December 2025

NUDGE is a reinforcement learning framework that integrates neural guidance with differentiable logic, enabling scalable symbolic reasoning for complex control tasks.
The methodology employs neural side-channels to prioritize candidate logic rules and co-learns high-level symbolic plans with low-level control for improved performance.
Empirical results demonstrate enhanced success rates and efficiency in robotics and Atari environments, validating its potential for real-world applications.

Neurally Guided Differentiable Logic Policies ("NUDGE") are a family of reinforcement learning techniques that integrate neural architectures and symbolic logic within an end-to-end differentiable framework. These methods seek to combine the interpretability, compositional abstraction, and formal constraint satisfaction of logic with the flexibility and scalability of neural networks, particularly in the context of complex control, planning, and symbolic reasoning environments. NUDGE approaches are characterized by the use of neural side-channels or pre-trained policies to guide the discovery or instantiation of logic rules, the enforcement of logic constraints via smooth, differentiable relaxations, and co-learning schemes that couple high-level discrete reasoning with low-level kinodynamic execution (Xiong et al., 2023, Delfosse et al., 2023).

1. Core Principles of NUDGE Architectures

NUDGE architectures instantiate reinforcement learning agents as (1) weighted logic programs or (2) neural modules whose outputs are explicitly shaped by logic constraints. The key principles include:

Differentiable logic semantics: Logic programs—typically first-order or temporal logic—are encoded with smooth, quantitative semantics (e.g., STL robustness $\rho(\tau, \phi)$ ), which allows gradients to flow through logic objectives to upstream parameters.
Neural guidance: Rather than searching the combinatorially large space of potential rules or plans from scratch, NUDGE leverages neural policies to prioritize or generate candidate logic rules or subgoals.
End-to-end differentiability: Integration of neural and logic modules is achieved with tensorized representations of logic rules, fuzzy-AND/OR relaxations, and differentiable forward chaining or loss formulations (Xiong et al., 2023, Delfosse et al., 2023).
Alternating or co-learning processes: High-level (logic/planning) and low-level (control/actuation) modules are trained in an alternating fashion, typically with policy-alignment mechanisms that ensure mutual feasibility.

This paradigm aims to produce agents that are performant, verifiable, interpretable, and explainable—meeting requirements of safety-critical autonomous systems.

2. Differentiable Logic and Temporal Logic Machinery

NUDGE methods employ differentiable logical constraints formulated in Signal Temporal Logic (STL) or first-order logic. For STL, the satisfaction of a specification $\phi$ over a trajectory $\tau$ is quantified by a robustness score $\rho(\tau, \phi)$ , recursively defined by:

Boolean constructors:

$\rho(\tau, \phi \land \psi) = \min \left( \rho(\tau, \phi), \rho(\tau, \psi) \right)$ $\rho(\tau, \phi \lor \psi) = \max \left( \rho(\tau, \phi), \rho(\tau, \psi) \right)$

$\rho(\tau, \lnot\phi) = -\rho(\tau, \phi)$

Temporal constructors (over window $[a,b]$ ): $\rho\left(\tau, \Diamond_{[a,b]}\phi\right) = \max_{t\in[a,b]} \rho(\tau_{[t:T]}, \phi)$

$\rho\left(\tau, \Box_{[a,b]}\phi\right) = \min_{t\in[a,b]} \rho(\tau_{[t:T]}, \phi)$

These operators are approximated with softmin/softmax for differentiability. For first-order logic, as in (Delfosse et al., 2023), differentiable forward chaining is realized via tensorized valuations of ground atoms and logic rules, facilitating gradient descent with actor-critic objectives.

The interplay of smooth logic constraints with trajectory-level loss enables direct optimization of symbolic task compliance.

3. Neural Guidance and Symbolic Abstraction

NUDGE leverages neural models to address the intractability of symbolic search or planning. This guidance appears in two primary forms:

Rule discovery guided by neural teachers: Pre-trained neural RL policies (actor-critic or DQN) are used to score and filter candidate logic rules via beam search. Scoring is done by computing the agreement (e.g., dot product) between the neural policy distribution and that induced by each logic rule on representative states. Resulting rule-sets ( $\mathcal{C}$ ) are typically compressed to a few dozen high-utility candidates (Delfosse et al., 2023).
Embedding physical context for symbolic planners: Neural perception modules (e.g., Mobile-SAM, Mobile-ViT pipelines) process raw input (images, sensor data) into spatial representations or symbolic scenes, which are then fed to logic/planning modules—maintaining the pipeline’s differentiability (Xiong et al., 2023).

The neural submodules thus serve two functions: (a) compact, efficient abstraction from high-dimensional observation spaces to logic-relevant features, and (b) guidance for rule or subgoal search and prioritization, accelerating discovery of effective symbolic policies.

4. Co-Learning and Loss Alignment Schemes

NUDGE frameworks often employ alternating optimization between high-level logic planners and low-level controllers, aligning their policies to enforce both symbolic correctness and kinodynamic feasibility. Core components:

Planner module ( $\pi^h_\phi$ ):

Consumes symbolic specifications, maps, and start states; synthesizes long-horizon subgoal sequences via a Transformer-based decoder conditioned on spatial embeddings.

Controller module ( $\pi^l$ ):

Receives goal-conditioned observations (e.g., proprioception, LiDAR) and outputs actuation commands via standard RL (e.g., PPO).

Alternating update loop:

For fixed planner, controllers adapt to follow new subgoal distributions. For fixed controller, planners are trained jointly with logic losses and controller-approvability:

$L_{\pi^h} = - \mathbb{E}[\log \Pr_\theta(\tau) (r^h(\tau; \pi^l) - b)] - \lambda \rho(\tau, \phi)$

with the first term favoring plans executable by the controller (short Time-to-Reach), and the second enforcing logic via direct backpropagation through $\rho(\tau, \phi)$ (Xiong et al., 2023).

This alignment prevents infeasible plans (which break robot dynamics constraints) and suboptimal wandering (where unconstrained RL agents fail to satisfy temporal logic).

5. Interpretability, Explainability, and Policy Structure

A distinguishing feature of NUDGE is the production of policies in compact, explicitly interpretable logic form. For first-order logic NUDGE agents (Delfosse et al., 2023), the policy is defined by a few weighted rules:

1
2
3

0.57 : jump(X)       :- closeby(O1,O2), type(O1,agent), type(O2,enemy).
0.29 : right_to_key(X):- ¬has_key(X), on_right(O2,O1), type(O1,agent), type(O2,key).
...

Such policies can be immediately read and understood, in contrast to black-box neural networks.

Explainability extends to local, per-decision rationales: Because forward chaining is differentiable, the Jacobian $\partial v_A / \partial v^{(0)}$ can be computed to attribute and explain the influence of particular state atoms on action choices. For example, in logic-based gridworld tasks, decisions such as "go right" can be causally linked to specific observations (e.g., lacking a key, and the key lying to the right).

For the differentiable STL-based NUDGE (Xiong et al., 2023), the structure of plans and the satisfaction of temporally compositional tasks are transparent via logic robustness traces.

6. Empirical Performance and Quantitative Comparisons

NUDGE frameworks have demonstrated strong empirical performance:

Robotics navigation (Xiong et al., 2023): NUDGE achieves >96% success rate and 20–30% reduction in time-to-reach versus unaligned baselines in complex long-horizon navigation with a Doggo quadruped (74D obs, 12D act) and TurtleBot3 (40D obs, 2D act). Real-world trials validate transferability: 5/6 to 6/6 successes per task versus ≤3/6 for unaligned.
Sample complexity and scalability: NUDGE converges in an order of magnitude fewer transitions than RL with reward shaping or reward machines. Path generation remains sub-second even for cluttered maps, while mixed-integer STL solvers (e.g., STLPY) scale poorly.
Symbolic environments and Atari (Delfosse et al., 2023): For relational tasks in GetOut, 3Fishes, and Loot, NUDGE matches or exceeds neural RL, especially when robustness is required across varied initial states and environment variants. On Atari OC-Asterix, NUDGE surpasses DQN (6,259 vs 125 average score).

A summary table of selected metrics:

Environment	Baseline	NUDGE (SR/Perf.)	Notes
Doggo Navigation	RL+Reward shaping	97.7% / 46s	Aligned vs. 87%/64s unaligned
TurtleBot3	RL+Unaligned	6/6 success	Real robot, rich temporal tasks
GetOut	Classic logic	17.9 ± 2.9	PPO and DQN underperform
3Fishes-C	PPO	3.26	NUDGE robust to variant, PPO crashes

Differentiable Weightless Controllers (DWC) (Kresse et al., 1 Dec 2025) and related architectures reflect convergence between NUDGE-style differentiable logic and purely propositional, hardware-optimized logic circuits. Both paradigms train Boolean layers or logic programs by gradient-based RL but differ in logic expressivity: DWCs operate at the level of Boolean LUTs and thermometer-encoded observations, while NUDGE supports first-order and temporal logics.

Potential synergies involve incorporating neural guidance for adaptable input discretization, interleaving neural and logic layers to boost function approximation, and extending LUT-based controllers to parametric predicate logic with learned templates. This unification could further the development of interpretable, verifiable, and efficient policies for continuous and relational environments (Kresse et al., 1 Dec 2025).

References

(Xiong et al., 2023): "Co-learning Planning and Control Policies Constrained by Differentiable Logic Specifications"
(Delfosse et al., 2023): "Interpretable and Explainable Logical Policies via Neurally Guided Symbolic Abstraction"
(Kresse et al., 1 Dec 2025): "Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control"