Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bi-RWKV Layers: Bidirectional RWKV Extensions

Updated 23 June 2026
  • Bi-RWKV Layers are architectural enhancements to the RWKV model that integrate bidirectional context via Bi-WKV operators and triplet-block pseudo-biaccess mechanisms.
  • The Bi-WKV operator fuses forward and backward recurrences using a dynamic gating mechanism, yielding improved accuracy and efficient global context aggregation.
  • The triplet-block layout achieves pseudo-bidirectional diffusion through a specialized masking strategy, maintaining linear-time complexity while enhancing conditioning for language and audio tasks.

Bi-RWKV layers are architectural extensions of the RWKV model family that integrate bidirectional or pseudo-bidirectional context aggregation into the original recurrent-weighted key-value (RWKV) paradigm. These modifications appear predominantly in two orthogonal forms: (1) Bi-WKV blocks as proposed for sequence modeling in audio ("AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition" (Xiong et al., 2 Sep 2025)), and (2) the triplet-block layout for pseudo-bidirectional discrete diffusion ("Triplet-Block Diffusion RWKV" (Lin et al., 25 May 2026)). Both enable various forms of global or blockwise conditioning in a linear-time, stable, and memory-efficient manner.

1. Mathematical Foundations of RWKV and Bidirectionalization

RWKV models construct O(L)-time sequence processors via recurrent key-value recurrences, parameterized by dynamic decay factors, keys, and values computed from per-token projections, plus a receptance gating vector. In the original formulation, the sequence is scanned left-to-right only, enforcing causal dependency.

Bidirectional RWKV introduces two variants:

  • Bi-WKV Operator: Two full WKV recurrences—one forward (t=1Lt = 1 \dots L) and one backward (t=L1t = L \dots 1)—are computed. The outputs are fused per position using a dynamic gate GtG_t derived from local features. This yields a layer output ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t, where \circ denotes elementwise multiplication, and Gt[0,1]dG_t \in [0,1]^d (Xiong et al., 2 Sep 2025).
  • Triplet-Block Pseudo-Biaccess ("Biaccess"): The input sequence is chunked into logical blocks. For each, three physical blocks are created: (1) masked, (2) masked (predictive; loss applied), (3) clean (hidden state reset). The causal model processes these left-to-right, but because unmasked tokens from block 1 are already absorbed into the hidden state before block 2 is encountered (where loss is computed), the effective context for each predicted token within block 2 is bidirectional relative to the logical block, despite preservation of strict causality at the model level (Lin et al., 25 May 2026).

2. Bi-WKV Layers: Formulation and Forward/Backward Recurrence

The Bi-WKV operator is formalized as follows (Xiong et al., 2 Sep 2025):

  • For each time step tt and LayerNorm-normalized input xtx_t,
    • Maintain t=L1t = L \dots 14
    • t=L1t = L \dots 15
    • t=L1t = L \dots 16
    • t=L1t = L \dots 17
    • 3. Backward pass (t=L1t = L \dots 18) with identical recurrences, yielding t=L1t = L \dots 19.
    • 4. Fuse with GtG_t0 from a depthwise separable convolutional "convshift" local residual: GtG_t1, where GtG_t2.

This mechanism maintains O(L·d) time complexity and two d-dimensional recurrent states per scan, making the bidirectional extension computationally efficient and stable. The use of double-exponential decay and S_V/S_K normalization ensures numerical stability at long time scales.

3. Triplet-Block Layout for Pseudo-Bidirectional Diffusion

The triplet-block mechanism ({editor's term: pseudo-biaccess}) was introduced to unify linear-time causal backbones with the parallel, bidirectional context required for discrete diffusion modeling (Lin et al., 25 May 2026). Given a sequence of length GtG_t3:

  • Divide input into GtG_t4 logical blocks of length GtG_t5.
  • For each logical block GtG_t6, create three physical blocks:
    • GtG_t7: Masked (no loss)
    • GtG_t8: Same masked copy (loss computed where not masked)
    • GtG_t9: Clean (resets the hidden state)
  • Apply randomly sampled mask pattern ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t0 per block, including full-mask and forced EOS/PAD positions.
  • Because the model operates strictly left-to-right, by the start of ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t1 the hidden state has already integrated unmasked tokens from ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t2, providing the lossable block with access to both left and right unmasked context within ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t3.
  • At inference, a block-level iterative token commitment process initializes all B positions as masked and commits tokens by confidence threshold ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t4, in a MaskGIT-inspired regime.

There are no modifications to the RWKV cell, hidden size, or weights; only the sequencing of physical input adapts, resulting in bidirectional conditioning at the block level while maintaining O(L) computational complexity.

4. Comparison of Approaches: Bi-WKV Versus Triplet-Block Biaccess

Aspect Bi-WKV (AudioRWKV) Triplet-Block (B³D-RWKV)
Directionality True bidirectional (full scan both directions) Pseudo-bidirectional within blocks
Computational Cost (per L) O(2L·d), still linear in L O(L), data-level overhead (3x at training)
Model Parameters Adds fusion gate, convshift residual No change (all weights untouched)
Context Aggregation Global, per-layer Block-local, per training arrangement
Primary Application Audio sequence modeling Discrete diffusion language modeling

Bi-WKV provides explicit bidirectional context by running separate forward and backward passes and fusing results, while the triplet-block layout arranges inputs such that the causal model’s hidden state delivers blockwise bidirectionality implicitly.

5. Empirical Performance and Complexity

The Bi-WKV approach, including fusion via a learnable sigmoid gate and local convshift residual, yields the following results on the AudioSet-2M mAP task (Xiong et al., 2 Sep 2025):

  • Causal RWKV7 baseline: 34.5 mAP
  • Adding bidirectional scan (average fusion): 38.4 mAP
  • Weighted gate fusion: 39.0 mAP
  • Full ConvShift + Bi-WKV: 40.9 mAP

AudioRWKV’s RWKV7 operator with Bi-WKV in spatial mixing provides up to a 13.3× speedup over FlashAttention on a 4090 GPU with long (ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t55 min) audio, with throughput remaining flat as sequence length increases.

For B³D-RWKV (Lin et al., 25 May 2026), throughput experiments on H100 with a 32-layer d=4096 RWKV-7-g1f-7.2B backbone in 8-GPU training demonstrate:

  • B³D-RWKV-7.2B achieves decoding throughput at ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t61.6× that of a standard causal RWKV-7-7.2B at comparable quality, reaching up to 2.0× with fewer inference steps or lower confidence thresholds.

Both methods attain substantial gains in bidirectional context efficiency and accuracy compared to their causal RWKV counterparts.

6. Implementation Notes and Stability

For Bi-WKV (Xiong et al., 2 Sep 2025):

  • All bidirectional extensions follow numerically stable RWKV7 recurrences, leveraging double-exponential decay (ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t7) for boundedness.
  • The fusion gate ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t8 is obtained from a convolutional residual, introducing minimal instability. No new unstable operators are introduced.
  • Training and inference memory requirements remain linear in sequence length, with minor constants due to dual scans and gating.

For triplet-block B³D-RWKV (Lin et al., 25 May 2026):

  • The approach is parameter-free at the architectural level. No new layers, parameters, or hidden state representations are involved.
  • All weight matrices and state transitions remain as in the vanilla RWKV backbone.
  • The only modification occurs at data layout: token sequencing and masking regime.
  • During training, ptfused=Gtpt+(1Gt)ptp^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t9 physical tokens per sequence yields a 3× overhead, still linear in \circ0; at inference amortized cost per block is \circ1 due to token commitment shortcuts.

7. Applications and Broader Impact

Bi-RWKV layers extend the applicability of the RWKV architecture to tasks requiring global or block-local bidirectional context. Bi-WKV is primarily used for audio pattern recognition, where full-sequence context is critical and efficient long-range modeling is required. The triplet-block system enables linear-time, parallelizable, discrete diffusion for language modeling, closing the gap between diffusion-based bidirectional inference and causally trained backbone models, while preserving the efficiency of RWKV.

These innovations enable high-throughput, stable, and generalizable modeling for both language and audio domains, leveraging bidirectional context without incurring quadratic attention cost or destabilizing recurrent recurrences. The Bi-WKV and triplet-block pseudobiaccess approaches thus represent significant advances in efficient sequence modeling architectures (Xiong et al., 2 Sep 2025, Lin et al., 25 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-RWKV Layers.