Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anticipated Reweighted Truncated BPTT

Updated 4 February 2026
  • The paper introduces ARTBP, achieving unbiased gradient estimates for RNNs by randomizing truncation points and applying compensation factors during backpropagation.
  • It maintains the computational efficiency and memory advantages of truncated BPTT while reliably capturing long-term dependencies in sequential data.
  • Empirical results on synthetic tasks and language modeling demonstrate ARTBP’s improved convergence and stability over traditional truncated BPTT.

Anticipated Reweighted Truncated Backpropagation (ARTBP) is a stochastic gradient estimation algorithm for training recurrent neural networks (RNNs) on long sequences. It preserves the computational and memory efficiency of truncated Backpropagation Through Time (BPTT) while providing unbiased gradient estimates, thereby enabling reliable learning of long-term dependencies. ARTBP achieves unbiasedness by randomizing truncation points and applying compensation factors within the backward recursion. The method was introduced by Tallec and Ollivier to address the convergence issues associated with the biased gradients of truncated BPTT (Tallec et al., 2017).

1. Background: BPTT and Truncated BPTT

Standard Backpropagation Through Time (BPTT) computes gradients for RNNs by unrolling the entire recurrent system and backpropagating gradients through every timestep. For a system defined by st+1=F(xt+1,st,θ)s_{t+1} = F(x_{t+1}, s_t, θ) and stepwise loss t=(st,ot)\ell_t = \ell(s_t, o_t), the total sequence loss is LT=t=1TtL_T = \sum_{t=1}^T \ell_t. The exact gradient requires storing all recurrent states {st}\{s_t\} and backpropagating through TT layers, resulting in O(T)O(T) space and compute requirements.

Truncated BPTT alleviates this cost by dividing the full sequence into consecutive fixed-length blocks of length LL. Gradients are computed only within each block, and the recurrence graph is cut at block boundaries—no gradient signals propagate across those boundaries. The backward recursion for approximate signals ^t\hat\ell_t sets the recurrent term to zero every LL steps. While this provides O(L)O(L) memory and update time, the resulting estimator is biased: dependencies across block boundaries are omitted, resulting in unreliable learning of long-term dependencies and possible divergence during training.

2. ARTBP Formalism

ARTBP removes the bias in truncated BPTT by introducing randomness in the truncation points and adjusting the backpropagation equations with compensation factors.

Notation and Recursion

  • Let Xt{0,1}X_t \in \{0,1\} indicate whether a truncation occurs between tt and t+1t+1.
  • ct=P(Xt=1X1,...,Xt1)c_t = P(X_t=1 | X_1, ..., X_{t-1}) is the (possibly time-varying) truncation probability.

The backward recursion for adjusted signals ~t\tilde\ell_t is:

~t={s(st,ot),if Xt=1 or t=T 11ct[~t+1Fs(xt+1,st,θ)]+s(st,ot),otherwise-\,\tilde\ell_t = \begin{cases} \frac{\partial\ell}{\partial s}(s_t,o_t), & \text{if } X_t=1 \text{ or } t=T \ \frac{1}{1-c_t}\left[ -\,\tilde\ell_{t+1} \cdot \frac{\partial F}{\partial s}(x_{t+1}, s_t, θ) \right] + \frac{\partial\ell}{\partial s}(s_t, o_t), & \text{otherwise} \end{cases}

The unbiased gradient estimator is then

g~=t=1T(~t)Fθ(xt,st1,θ)\tilde{g} = \sum_{t=1}^T (-\tilde\ell_t) \cdot \frac{\partial F}{\partial θ}(x_t, s_{t-1}, θ)

Truncation Distribution

  • Geometric: Constant ct=cc_t=c, yielding exponentially distributed block lengths.
  • Heavy-tailed: For variance control,

ct=α1(α2)L0+Δtc_t = \frac{α-1}{(α-2)L_0 + \Delta t}

where ΔtΔt is time since the last truncation, L0L_0 is the target mean length, and α>3α > 3.

3. ARTBP Algorithm and Computational Properties

The typical ARTBP workflow includes:

  1. Sampling subsequence lengths on the fly via Bernoulli draws with probability ctc_t.
  2. Forward pass computing st=F(xt,st1,θ)s_t = F(x_t, s_{t-1}, θ) over the chosen segment.
  3. Backward pass computing ~t\tilde\ell_t with compensation, as above.
  4. Gradient accumulation: GG+(~t)Fθ(xt,st1,θ)G \leftarrow G + (-\tilde\ell_t)\cdot \frac{\partial F}{\partial θ}(x_t, s_{t-1}, θ).
  5. Parameter update: θθηGθ \leftarrow θ - ηG.
Method Space per Update Time per Update Unbiased?
BPTT (full) O(T)O(T) O(T)O(T) Yes
Truncated BPTT O(L)O(L) O(L)O(L) No
ARTBP O(L)O(L') (mean L0L_0) O(L)O(L') (mean L0L_0) Yes

LL' denotes a random subsequence length with mean L0L_0. Overhead incurred by sampling XtX_t and compensating by 1/(1ct)1/(1-c_t) is negligible.

4. Theoretical Unbiasedness

Tallec & Ollivier prove that

EX[g~]=LTθE_X[\tilde{g}] = \frac{\partial L_T}{\partial θ}

where EXE_X denotes the expectation over the stochastic truncation schedule. The proof utilizes backward induction, showing that the expected backward signal satisfies: E[~tX1,...,Xt1]=tE[-\tilde\ell_t | X_1,...,X_{t-1}] = -\ell_t At each time tt, the continuation of the gradient is either dropped with probability ctc_t or rescaled by 1/(1ct)1/(1-c_t) with probability 1ct1-c_t, which ensures in expectation a single contribution per step, matching the BPTT recursion. No structural assumptions are required beyond ct[0,1)c_t \in [0,1) and the Markov property of the truncation process.

5. Empirical Evaluations

Influence-Balancing Synthetic Task

  • Setup: Linear chain of p=10p=10 positive and n=13n=13 negative agents, with signals arriving at a delay.
  • Methods: Truncated BPTT with L{10,100,200}L \in \{10, 100, 200\}; ARTBP with L0=16L_0 = 16, α=6α = 6.
  • Findings:
    • Truncated BPTT diverges for L=10L=10 and L=100L=100 due to bias; only L=200L=200 converges, and slowly.
    • ARTBP converges reliably for all random seeds at a rate O(t1/2)O(t^{-1/2}).
    • Unbiasedness is necessary to balance multi-scale temporal dependencies.

Penn Treebank Character-Level Language Modeling

  • Model: Single-layer LSTM, batch size 64, Adam optimizer.
  • Schedules: Fixed L=50L=50 for truncated BPTT; ARTBP uses ctc_t from α=4α=4, L0=50L_0=50.
  • Results:
    • Truncated BPTT test bpc: \approx 1.43.
    • ARTBP test bpc: \approx 1.40.
    • ARTBP provides a small but observable improvement in validation and test metrics.
    • Variations in αα ($4$ vs $6$) had minor impact; smaller L0L_0 decreases memory usage but increases gradient variance.

6. Practical Considerations and Limitations

  • Truncation distribution: For a known memory budget LL, set L0LL_0 \approx L. A constant ct=1/Lc_t=1/L (geometric) is simplest but amplifies gradient variance; heavy-tailed ct1/(Δt+const)c_t \propto 1/(\Delta t + \text{const}) with α>3α>3 manages high variance in compensation factors.
  • Variance vs. memory trade-off: Lower ctc_t (longer segments) reduces variance at the cost of memory; higher ctc_t (shorter blocks) increases stochasticity and may require gradient norm monitoring or clipping.
  • Operational modes: ARTBP supports both online streaming (stepwise) and mini-batch operation. Batch samples should not cross truncation boundaries to preserve the ctc_t schedule.
  • Limitations: ARTBP introduces gradient noise from the stochastic schedule, which may slow convergence in deterministic settings relative to full BPTT. Compensation factors can become large if ctc_t approaches unity, potentially destabilizing updates. Monitoring gradient norms and increasing L0L_0 or αα is recommended if instability or high variance is observed.
  • Recommended procedure: Initialize L0L_0 and α4α \approx 4; adjust based on observed gradient variance and available memory.

7. Significance and Applications

ARTBP solves the longstanding issue of gradient bias in truncated BPTT for RNN training on long sequences. The unbiasedness of ARTBP’s gradient estimates makes it suitable for tasks where accurate credit assignment across long temporal horizons is essential. Its memory and computational complexity are comparable to traditional truncated BPTT, with the added practical consideration of managing gradient variance via hyperparameter choices. ARTBP provides a principled solution without imposing architectural changes on the underlying recurrent model or loss functions (Tallec et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipated Reweighted Truncated BPTT (ARTBP).