Action-conditioned RMS Q-Functions (ARQ)
- The paper introduces ARQ, a local, backpropagation-free TD learning method that uses action-conditioning and an RMS-based goodness function for per-cell Q-estimation.
- ARQ leverages independent module updates and rich state–action embeddings to generalize across both discrete and continuous action spaces.
- Empirical results on MinAtar and DMC benchmarks show ARQ matching or exceeding traditional methods while providing a biologically plausible approach to credit assignment.
Action-conditioned Root Mean Squared Q-Functions (ARQ) define a backpropagation-free, local temporal-difference learning paradigm for reinforcement learning (RL). ARQ builds upon the Forward-Forward (FF) algorithm's layerwise "goodness" concepts, introducing an action-conditioning mechanism and a root-mean-squared (RMS) function on hidden unit activations, enabling local Q-estimation that generalizes across both discrete and continuous action spaces. The framework eschews global error propagation, instead ascribing all credit assignment and gradient computation to independently updated modules, resulting in a biologically plausible and empirically competitive approach for value-based RL (Wu et al., 8 Oct 2025).
1. Motivation and Relation to Prior Methods
Standard deep Q-learning methods such as DQN approximate the state–action value as a scalar produced by end-to-end deep networks, typically trained with global backpropagation. While effective, this imposes biologically implausible weight symmetry and non-local credit assignment. The Forward-Forward (FF) algorithm avoids backward gradient flow, instead using two forward passes and a layerwise "goodness" measure , but its RL applicability has remained limited.
Artificial Dopamine (AD) [Guan et al., 2024] advances local RL by training stacks of "AD-cells," each producing via a local, attention-style dot product, yet requires fixing the hidden output size to match the action set, constraining model capacity. By comparison, ARQ introduces action-conditioning at the input, enabling vector-valued hidden states with unconstrained dimensionality, and defines a local RMS-based goodness function instead of a dot-product. Every local cell thereby represents rich state–action embeddings and supports backpropagation-free, per-cell value estimation (Wu et al., 8 Oct 2025).
2. Mathematical Formulation and Cell Architecture
Each ARQ module (cell) receives at time a composite input:
- State embedding
- Previous-layer activity
- Top-down activity
- Discrete or discretized action , encoded as one-hot or binary
Inputs are concatenated as . The cell then computes:
- 0
- 1
- 2
- Attention matrix 3
- Projected activations 4
The ARQ goodness function is defined as:
5
Layer normalization is used on pre-activations, and 6 squashes 7 to stabilize magnitudes; the RMS step is parameter-free.
3. Temporal-Difference Updates and Local Learning Rule
Each ARQ module is regarded as a local Q-estimator: 8. For a transition 9 and a target network 0, the Bellman target is
1
The local TD error for cell 2 is
3
Cells perform a local gradient step on the squared error loss,
4
with each module's parameters solely affecting its forward computation. No gradient signals or weight updates propagate between cells.
4. Action-Conditioning and Representation
ARQ implements action-conditioning by appending the action encoding to each module's input. For discrete actions 5, a one-hot vector is used; for continuous actions, "bang–bang" discretization (Seyde et al., 2021) encodes each degree of freedom with two binary features. Because the action is an input to the network, every hidden unit and attention operation becomes sensitive to 6, enabling richer per-action hidden representations, as compared to conventional output-head approaches that yield a vector over 7 actions only at the network's output.
The schematic below demonstrates cell computation:
| Input | Linear/Attention Operations | Goodness Output |
|---|---|---|
| 8 | 9, 0, 1, 2, LayerNorm, ReLU | 3 |
5. Empirical Performance and Benchmarks
Benchmarks include MinAtar (5 games) and DeepMind Control Suite (5 environments), following protocols from Guan et al. (2024). Typical architectures are:
- MinAtar: Three layers 4, each replicated in 5 ARQ cells with skip and top-down connections, LayerNorm, and ReLU activations.
- DMC: Three layers 6. Hyperparameters: learning rate 7, batch size 512, discount 8, 9-greedy schedule from 0 across the first 10% of 1M total steps, replay buffer, target network update every 2k steps, learning starts at step 3k.
Results indicate that ARQ outperforms Artificial Dopamine on all MinAtar games and most DMC tasks, and matches or exceeds backpropagation-trained standards (DQN, SAC, TD-MPC²) on many environments (Wu et al., 8 Oct 2025).
6. Theoretical Properties and Algorithmic Guarantees
ARQ inherits the convergence guarantees of tabular Q-learning provided sufficient exploration, decaying learning rates, and Markovian environment dynamics. In the presence of nonlinear function approximation within each local cell, no formal global convergence proof is established, but empirical stability is observed when employing target networks and LayerNorm. All parameter updates are local and rely solely on per-cell TD errors; global gradient flow is neither used nor required.
7. Strengths, Limitations, and Prospective Extensions
Strengths:
- Entirely local, backpropagation-free updates per cell.
- Action-conditioning at input enables expressive per-action representations.
- RMS-based goodness enables expansion to large hidden dimensions 4 without parameter scaling issues.
- Demonstrated strong empirical results on both discrete and continuous control benchmarks.
Limitations:
- Continuous action spaces require discretization ("bang–bang" or coarse binning).
- Computational cost scales linearly with 5 due to per-action forward passes unless optimized.
- Formal convergence in the general nonlinear regime remains unproven.
- Experiments are limited to low-dimensional state spaces; application to raw images is unspecified.
Prospective Extensions:
- Incorporation of FF-style negative-pass (contrastive) signals for RL.
- Application to convolutional or transformer-based networks.
- Investigation of critics for continuous control without discretization, such as actor–critic hybrids.
- Theoretical study of local-cell training as a form of multi-agent Q-learning.
A plausible implication is that ARQ's cell-local, action-sensitive value estimation could inform biologically plausible and scalable alternatives to backpropagation in RL, especially when output-dimensionality constraints or implementation of global credit assignment are prohibitive (Wu et al., 8 Oct 2025).