Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Weighted Average (RWA)

Updated 13 June 2026
  • Recurrent Weighted Average (RWA) is an RNN architecture that computes a dynamic running weighted average over past inputs, integrating attention into its recurrence.
  • It updates cumulative numerator and denominator in constant time (O(1)) per step, enabling efficient handling of long-range dependencies without costly retrospective attention.
  • Empirical evaluations show RWA converges faster and uses fewer parameters than LSTM on sequential tasks, though it faces challenges when tasks require revising past contributions.

The Recurrent Weighted Average (RWA) is a recurrent neural network (RNN) architecture that integrates an attention-like, running weighted average computation directly into the recurrence, permitting efficient, O(1)O(1)-per-step processing with access to the full historical context. Unlike conventional RNNs such as LSTM or GRU, which propagate information through a strictly sequential, one-step-at-a-time mechanism, the RWA model permits each new hidden state to be a direct, dynamically reweighted summary of all previously observed inputs, providing a powerful mechanism for capturing long-range dependencies in sequential data (Ostmeyer et al., 2017, Maginnis et al., 2017).

1. Motivation and Conceptual Distinction

Traditional RNNs—including LSTM and GRU—update their hidden state exclusively from the immediately preceding state and current input. Information from earlier in the sequence must traverse a chain of potentially hundreds or thousands of steps, causing severe challenges with vanishing and exploding gradients. Attention mechanisms, introduced to address this, compute a weighted sum over all intermediate states but are typically applied post hoc and entail O(T2)O(T^2) computational cost for sequences of length TT.

The central insight of the RWA architecture is to "bake in" the attention mechanism within the recurrence relation. Each processing step maintains a running, weighted average across historical feature encodings, parameterized by dynamically computed attention weights. The essential innovation is that this computation can be maintained as two running sums, thereby yielding O(1)O(1) update cost per step—matching the efficiency of classical RNN units, yet with a representational capacity strongly reminiscent of global attention (Ostmeyer et al., 2017, Maginnis et al., 2017).

2. Mathematical Formulation

Given an input sequence (x1,x2,,xT)(x_1, x_2, \dots, x_T), initial state h0=f(s0)h_0 = f(s_0) with f()=tanh()f(\cdot)=\tanh(\cdot) and learned s0s_0, the recurrent update equations at timestep t1t\geq 1 are as follows:

ut=Wuxt+bu gt=Wg[xt;ht1]+bg at=Wa[xt;ht1] zt=uttanh(gt) nt=nt1+zteat dt=dt1+eat ht=f(nt/dt)\begin{aligned} u_t &= W_u x_t + b_u \ g_t &= W_g [x_t; h_{t-1}] + b_g \ a_t &= W_a [x_t; h_{t-1}] \ z_t &= u_t \circ \tanh(g_t) \ n_t &= n_{t-1} + z_t \circ e^{a_t} \ d_t &= d_{t-1} + e^{a_t} \ h_t &= f(n_t / d_t) \end{aligned}

Here O(T2)O(T^2)0 denotes feature concatenation, O(T2)O(T^2)1 indicates element-wise multiplication, and all attention is performed through the exponentiated attention logits O(T2)O(T^2)2. This formulation enables the computation of O(T2)O(T^2)3 as a nonlinear function of a weighted average over all prior signed encodings O(T2)O(T^2)4, with weights determined by locally computed attention scores.

All architectural parameters are standard dense layer weights and biases: O(T2)O(T^2)5, O(T2)O(T^2)6, O(T2)O(T^2)7, O(T2)O(T^2)8, and O(T2)O(T^2)9, with TT0 typically set to 250 hidden units (Ostmeyer et al., 2017, Maginnis et al., 2017).

3. Algorithmic Properties and Computational Complexity

Unlike explicit attention mechanisms that require retaining and reaccessing all past activations (TT1 per step, TT2 over a full sequence), the RWA model's numerator and denominator can be updated incrementally, as:

TT3

Thus, the per-step runtime and memory do not scale with the number of processed timesteps, but are constant and determined solely by the hidden dimension TT4. This makes RWA highly suitable for very long or streaming sequences (Ostmeyer et al., 2017, Maginnis et al., 2017).

To ensure numerical stability for large exponents or long sequences (where TT5 may overflow), maintaining a running maximum of TT6 and rescaling the numerator and denominator in lockstep is recommended (see Appendix B in (Ostmeyer et al., 2017)).

4. Empirical Performance on Benchmark Tasks

The RWA model has been systematically benchmarked against LSTM across a suite of synthetic and real-world sequence tasks, always keeping both models at identical hidden size (TT7), initialization, and optimization settings (Ostmeyer et al., 2017):

Task Metric/Result RWA Performance LSTM Performance
Artificial Grammar Steps to 100% accuracy ~600 ~1,000
Sequence Length Steps to near-perfect accuracy ~100 ~2,000
Variable Copy (T=100) Steps to beat baseline ~1,000 ~10,000
Variable Copy (T=1000) Steps to beat baseline ~3,000 barely after ~50,000
Adding Problem (T=100) Steps to beat baseline MSE ~1,000 ~3,000
Adding Problem (T=1000) Steps to beat baseline MSE ~1,000 ~15,000
MNIST Sequential Test Accuracy after 250k steps 98.1% (unpermuted), 93.5% (permuted) 99.0% (unpermuted), 93.6% (permuted)

Across multiple experiments, RWA demonstrates accelerated convergence, strong performance on long-range dependency tasks, and achieves or exceeds LSTM in final accuracy or error in most single-output, aggregate-global-sequence settings. It consistently uses approximately 25% fewer parameters per hidden unit than LSTM, with identical TT8 per-step cost (Ostmeyer et al., 2017).

On multi-copy or character-level modeling tasks requiring the ability to forget or revise the influence of earlier inputs, RWA shows marked limitations, failing to learn the multi-copy task or performing poorly on the Wikipedia bits-per-character task (TT9 bpc for RWA vs O(1)O(1)0 for LSTM/GRU) (Maginnis et al., 2017).

5. Relationship to Attention Mechanisms

Classical attention computes a weighted sum over a non-recurrent collection of encoder outputs, requiring revisiting the entire sequence for every decoding step—which is computationally infeasible for long or streaming data. RWA replaces this paradigm with a recurrent, running normalization, where attention weights are integrated into the state update and relevance for past steps is accumulative, not retroactively adjustable.

Specifically, RWA's log-attention O(1)O(1)1 is computed on-the-fly from current input and previous hidden state, then exponentiated and used to reweight the signed encoding O(1)O(1)2 in the global average. While this allows direct, fast feedback from every previous step, it also means that once an input's weight is accumulated, it cannot be discounted or revised. This behavior is sharply contrasted with classical attention, which is flexible but incurs high cost (Ostmeyer et al., 2017, Maginnis et al., 2017).

6. Architecture, Initialization, and Implementation

  • Input handling: O(1)O(1)3, supports one‐hot or real-valued encodings.
  • Activations: O(1)O(1)4 for nonlinearity; gating with O(1)O(1)5; signed encoding via elementwise product.
  • Initialization: O(1)O(1)6, O(1)O(1)7, other weights from uniform in O(1)O(1)8.
  • Optimization: Adam (O(1)O(1)9), batch size 100, gradient clipping not required in original work but proposed if instability encountered.
  • Output: Fully connected layer on (x1,x2,,xT)(x_1, x_2, \dots, x_T)0 or (x1,x2,,xT)(x_1, x_2, \dots, x_T)1 with cross-entropy for classification, MSE for regression (Ostmeyer et al., 2017).

7. Limitations and Subsequent Developments

By construction, RWA cannot forget or revise the weights assigned to earlier timesteps; the running denominator (x1,x2,,xT)(x_1, x_2, \dots, x_T)2 is strictly increasing, and altering the influence of early points would require exponentially larger weights for later points. This limitation prevents RWA from effectively handling tasks involving multiple, consecutive sub-tasks or those where local, recent context is essential (e.g., repeated copy, next-symbol prediction in language modeling).

To address these drawbacks, the Recurrent Discounted Attention (RDA) unit extends RWA by introducing a discount gate, permitting the model to actively reduce the impact of previous states and support complex, multi-output or forgetting-required scenarios with higher efficiency and accuracy (Maginnis et al., 2017).

Further open directions for RWA include bidirectional architectures, stacked hierarchies, autoencoding, and applications in real-world domains such as NLP, genomics, and generative music, though systematic evaluation on large-scale, naturally occurring datasets remains to be undertaken (Ostmeyer et al., 2017).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Weighted Average (RWA).