Anticipated Reweighted Truncated BPTT

Updated 4 February 2026

The paper introduces ARTBP, achieving unbiased gradient estimates for RNNs by randomizing truncation points and applying compensation factors during backpropagation.
It maintains the computational efficiency and memory advantages of truncated BPTT while reliably capturing long-term dependencies in sequential data.
Empirical results on synthetic tasks and language modeling demonstrate ARTBP’s improved convergence and stability over traditional truncated BPTT.

Anticipated Reweighted Truncated Backpropagation (ARTBP) is a stochastic gradient estimation algorithm for training recurrent neural networks (RNNs) on long sequences. It preserves the computational and memory efficiency of truncated Backpropagation Through Time (BPTT) while providing unbiased gradient estimates, thereby enabling reliable learning of long-term dependencies. ARTBP achieves unbiasedness by randomizing truncation points and applying compensation factors within the backward recursion. The method was introduced by Tallec and Ollivier to address the convergence issues associated with the biased gradients of truncated BPTT (Tallec et al., 2017).

1. Background: BPTT and Truncated BPTT

Standard Backpropagation Through Time (BPTT) computes gradients for RNNs by unrolling the entire recurrent system and backpropagating gradients through every timestep. For a system defined by $s_{t+1} = F(x_{t+1}, s_t, θ)$ and stepwise loss $\ell_t = \ell(s_t, o_t)$ , the total sequence loss is $L_T = \sum_{t=1}^T \ell_t$ . The exact gradient requires storing all recurrent states $\{s_t\}$ and backpropagating through $T$ layers, resulting in $O(T)$ space and compute requirements.

Truncated BPTT alleviates this cost by dividing the full sequence into consecutive fixed-length blocks of length $L$ . Gradients are computed only within each block, and the recurrence graph is cut at block boundaries—no gradient signals propagate across those boundaries. The backward recursion for approximate signals $\hat\ell_t$ sets the recurrent term to zero every $L$ steps. While this provides $O(L)$ memory and update time, the resulting estimator is biased: dependencies across block boundaries are omitted, resulting in unreliable learning of long-term dependencies and possible divergence during training.

2. ARTBP Formalism

ARTBP removes the bias in truncated BPTT by introducing randomness in the truncation points and adjusting the backpropagation equations with compensation factors.

Notation and Recursion

Let $X_t \in \{0,1\}$ indicate whether a truncation occurs between $t$ and $t+1$ .
$c_t = P(X_t=1 | X_1, ..., X_{t-1})$ is the (possibly time-varying) truncation probability.

The backward recursion for adjusted signals $\tilde\ell_t$ is:

$-\,\tilde\ell_t = \begin{cases} \frac{\partial\ell}{\partial s}(s_t,o_t), & \text{if } X_t=1 \text{ or } t=T \ \frac{1}{1-c_t}\left[ -\,\tilde\ell_{t+1} \cdot \frac{\partial F}{\partial s}(x_{t+1}, s_t, θ) \right] + \frac{\partial\ell}{\partial s}(s_t, o_t), & \text{otherwise} \end{cases}$

The unbiased gradient estimator is then

$\tilde{g} = \sum_{t=1}^T (-\tilde\ell_t) \cdot \frac{\partial F}{\partial θ}(x_t, s_{t-1}, θ)$

Truncation Distribution

Geometric: Constant $c_t=c$ , yielding exponentially distributed block lengths.
Heavy-tailed: For variance control,

$c_t = \frac{α-1}{(α-2)L_0 + \Delta t}$

where $Δt$ is time since the last truncation, $L_0$ is the target mean length, and $α > 3$ .

3. ARTBP Algorithm and Computational Properties

The typical ARTBP workflow includes:

Sampling subsequence lengths on the fly via Bernoulli draws with probability $c_t$ .
Forward pass computing $s_t = F(x_t, s_{t-1}, θ)$ over the chosen segment.
Backward pass computing $\tilde\ell_t$ with compensation, as above.
Gradient accumulation: $G \leftarrow G + (-\tilde\ell_t)\cdot \frac{\partial F}{\partial θ}(x_t, s_{t-1}, θ)$ .
Parameter update: $θ \leftarrow θ - ηG$ .

Method	Space per Update	Time per Update	Unbiased?
BPTT (full)	$O(T)$	$O(T)$	Yes
Truncated BPTT	$O(L)$	$O(L)$	No
ARTBP	$O(L')$ (mean $L_0$ )	$O(L')$ (mean $L_0$ )	Yes

$L'$ denotes a random subsequence length with mean $L_0$ . Overhead incurred by sampling $X_t$ and compensating by $1/(1-c_t)$ is negligible.

4. Theoretical Unbiasedness

Tallec & Ollivier prove that

$E_X[\tilde{g}] = \frac{\partial L_T}{\partial θ}$

where $E_X$ denotes the expectation over the stochastic truncation schedule. The proof utilizes backward induction, showing that the expected backward signal satisfies: $E[-\tilde\ell_t | X_1,...,X_{t-1}] = -\ell_t$ At each time $t$ , the continuation of the gradient is either dropped with probability $c_t$ or rescaled by $1/(1-c_t)$ with probability $1-c_t$ , which ensures in expectation a single contribution per step, matching the BPTT recursion. No structural assumptions are required beyond $c_t \in [0,1)$ and the Markov property of the truncation process.

5. Empirical Evaluations

Influence-Balancing Synthetic Task

Setup: Linear chain of $p=10$ positive and $n=13$ negative agents, with signals arriving at a delay.
Methods: Truncated BPTT with $L \in \{10, 100, 200\}$ ; ARTBP with $L_0 = 16$ , $α = 6$ .
Findings:
- Truncated BPTT diverges for $L=10$ and $L=100$ due to bias; only $L=200$ converges, and slowly.
- ARTBP converges reliably for all random seeds at a rate $O(t^{-1/2})$ .
- Unbiasedness is necessary to balance multi-scale temporal dependencies.

Penn Treebank Character-Level Language Modeling

Model: Single-layer LSTM, batch size 64, Adam optimizer.
Schedules: Fixed $L=50$ for truncated BPTT; ARTBP uses $c_t$ from $α=4$ , $L_0=50$ .
Results:
- Truncated BPTT test bpc: $\approx$ 1.43.
- ARTBP test bpc: $\approx$ 1.40.
- ARTBP provides a small but observable improvement in validation and test metrics.
- Variations in $α$ ($4$ vs $6$) had minor impact; smaller $L_0$ decreases memory usage but increases gradient variance.

6. Practical Considerations and Limitations

Truncation distribution: For a known memory budget $L$ , set $L_0 \approx L$ . A constant $c_t=1/L$ (geometric) is simplest but amplifies gradient variance; heavy-tailed $c_t \propto 1/(\Delta t + \text{const})$ with $α>3$ manages high variance in compensation factors.
Variance vs. memory trade-off: Lower $c_t$ (longer segments) reduces variance at the cost of memory; higher $c_t$ (shorter blocks) increases stochasticity and may require gradient norm monitoring or clipping.
Operational modes: ARTBP supports both online streaming (stepwise) and mini-batch operation. Batch samples should not cross truncation boundaries to preserve the $c_t$ schedule.
Limitations: ARTBP introduces gradient noise from the stochastic schedule, which may slow convergence in deterministic settings relative to full BPTT. Compensation factors can become large if $c_t$ approaches unity, potentially destabilizing updates. Monitoring gradient norms and increasing $L_0$ or $α$ is recommended if instability or high variance is observed.
Recommended procedure: Initialize $L_0$ and $α \approx 4$ ; adjust based on observed gradient variance and available memory.

7. Significance and Applications

ARTBP solves the longstanding issue of gradient bias in truncated BPTT for RNN training on long sequences. The unbiasedness of ARTBP’s gradient estimates makes it suitable for tasks where accurate credit assignment across long temporal horizons is essential. Its memory and computational complexity are comparable to traditional truncated BPTT, with the added practical consideration of managing gradient variance via hyperparameter choices. ARTBP provides a principled solution without imposing architectural changes on the underlying recurrent model or loss functions (Tallec et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Unbiasing Truncated Backpropagation Through Time (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipated Reweighted Truncated BPTT (ARTBP).