Create a Video View Paper

Pretraining Recurrent Networks without Recurrence

This presentation explores Supervised Memory Training (SMT), a groundbreaking non-recurrent pretraining method for RNNs that sidesteps the fundamental limitations of Backpropagation Through Time. By decoupling memory representation from memory dynamics and using Transformer-derived predictive states as supervision targets, SMT achieves time-parallel training with constant-length gradient paths, enabling robust learning of long-range dependencies while maintaining the computational efficiency of fixed-memory architectures.

Script

Training recurrent networks has always meant unrolling them backward through time, watching gradients vanish or explode as they traverse hundreds of steps. The authors of this paper ask a radical question: what if you could pretrain an RNN without any recurrence at all?

Supervised Memory Training decouples what to remember from how to update memory. A Transformer encoder produces predictive states that capture all necessary context, and the RNN learns only the one-step dynamics, reducing the gradient path from order T to order 1.

This architectural shift has profound consequences for gradient stability. While BPTT gradients vanish exponentially with sequence length and remain sensitive to initialization, SMT gradients stay uniform across all timesteps, eliminating the recency bias that cripples long-range credit assignment.

Across synthetic benchmarks probing gradient flow, memory utilization, and compositional reasoning, SMT followed by DAgger Memory Training dominates BPTT in every dimension. On realistic tasks like character-level language modeling and pixel sequence prediction, SMT-pretrained RNNs generate globally coherent outputs where BPTT fails entirely.

The method exhibits classic scaling laws: longer context and larger memory yield predictable gains, and the performance gap between student RNN and teacher Transformer narrows with model capacity. Critically, additional compute enables further memory compression, linking efficiency with bounded hardware requirements.

Supervised Memory Training opens a path to fixed-memory architectures that learn like Transformers but run with the efficiency of recurrent models. To explore how these methods might reshape sequence modeling or to create your own video summaries of emerging research, visit EmergentMind.com.