Bi-RWKV Layers: Bidirectional RWKV Extensions

Updated 23 June 2026

Bi-RWKV Layers are architectural enhancements to the RWKV model that integrate bidirectional context via Bi-WKV operators and triplet-block pseudo-biaccess mechanisms.
The Bi-WKV operator fuses forward and backward recurrences using a dynamic gating mechanism, yielding improved accuracy and efficient global context aggregation.
The triplet-block layout achieves pseudo-bidirectional diffusion through a specialized masking strategy, maintaining linear-time complexity while enhancing conditioning for language and audio tasks.

Bi-RWKV layers are architectural extensions of the RWKV model family that integrate bidirectional or pseudo-bidirectional context aggregation into the original recurrent-weighted key-value (RWKV) paradigm. These modifications appear predominantly in two orthogonal forms: (1) Bi-WKV blocks as proposed for sequence modeling in audio ("AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition" (Xiong et al., 2 Sep 2025)), and (2) the triplet-block layout for pseudo-bidirectional discrete diffusion ("Triplet-Block Diffusion RWKV" (Lin et al., 25 May 2026)). Both enable various forms of global or blockwise conditioning in a linear-time, stable, and memory-efficient manner.

1. Mathematical Foundations of RWKV and Bidirectionalization

RWKV models construct O(L)-time sequence processors via recurrent key-value recurrences, parameterized by dynamic decay factors, keys, and values computed from per-token projections, plus a receptance gating vector. In the original formulation, the sequence is scanned left-to-right only, enforcing causal dependency.

Bidirectional RWKV introduces two variants:

Bi-WKV Operator: Two full WKV recurrences—one forward ( $t = 1 \dots L$ ) and one backward ( $t = L \dots 1$ )—are computed. The outputs are fused per position using a dynamic gate $G_t$ derived from local features. This yields a layer output $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ , where $\circ$ denotes elementwise multiplication, and $G_t \in [0,1]^d$ (Xiong et al., 2 Sep 2025).
Triplet-Block Pseudo-Biaccess ("Biaccess"): The input sequence is chunked into logical blocks. For each, three physical blocks are created: (1) masked, (2) masked (predictive; loss applied), (3) clean (hidden state reset). The causal model processes these left-to-right, but because unmasked tokens from block 1 are already absorbed into the hidden state before block 2 is encountered (where loss is computed), the effective context for each predicted token within block 2 is bidirectional relative to the logical block, despite preservation of strict causality at the model level (Lin et al., 25 May 2026).

2. Bi-WKV Layers: Formulation and Forward/Backward Recurrence

The Bi-WKV operator is formalized as follows (Xiong et al., 2 Sep 2025):

For each time step $t$ $t$ and LayerNorm-normalized input $x_t$ $x_{t}$ ,
- Maintain $t = L \dots 1$ 4
- $t = L \dots 1$ 5
- $t = L \dots 1$ 6
- $t = L \dots 1$ 7
- 3. Backward pass ( $t = L \dots 1$ 8) with identical recurrences, yielding $t = L \dots 1$ 9.
- 4. Fuse with $G_t$ 0 from a depthwise separable convolutional "convshift" local residual: $G_t$ 1, where $G_t$ 2.

This mechanism maintains O(L·d) time complexity and two d-dimensional recurrent states per scan, making the bidirectional extension computationally efficient and stable. The use of double-exponential decay and S_V/S_K normalization ensures numerical stability at long time scales.

3. Triplet-Block Layout for Pseudo-Bidirectional Diffusion

The triplet-block mechanism ({editor's term: pseudo-biaccess}) was introduced to unify linear-time causal backbones with the parallel, bidirectional context required for discrete diffusion modeling (Lin et al., 25 May 2026). Given a sequence of length $G_t$ 3:

Divide input into $G_t$ 4 logical blocks of length $G_t$ 5.
For each logical block $G_t$ $G_{t}$ 6, create three physical blocks:
- $G_t$ 7: Masked (no loss)
- $G_t$ 8: Same masked copy (loss computed where not masked)
- $G_t$ 9: Clean (resets the hidden state)
Apply randomly sampled mask pattern $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 0 per block, including full-mask and forced EOS/PAD positions.
Because the model operates strictly left-to-right, by the start of $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 1 the hidden state has already integrated unmasked tokens from $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 2, providing the lossable block with access to both left and right unmasked context within $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 3.
At inference, a block-level iterative token commitment process initializes all B positions as masked and commits tokens by confidence threshold $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 4, in a MaskGIT-inspired regime.

There are no modifications to the RWKV cell, hidden size, or weights; only the sequencing of physical input adapts, resulting in bidirectional conditioning at the block level while maintaining O(L) computational complexity.

4. Comparison of Approaches: Bi-WKV Versus Triplet-Block Biaccess

Aspect	Bi-WKV (AudioRWKV)	Triplet-Block (B³D-RWKV)
Directionality	True bidirectional (full scan both directions)	Pseudo-bidirectional within blocks
Computational Cost (per L)	O(2L·d), still linear in L	O(L), data-level overhead (3x at training)
Model Parameters	Adds fusion gate, convshift residual	No change (all weights untouched)
Context Aggregation	Global, per-layer	Block-local, per training arrangement
Primary Application	Audio sequence modeling	Discrete diffusion language modeling

Bi-WKV provides explicit bidirectional context by running separate forward and backward passes and fusing results, while the triplet-block layout arranges inputs such that the causal model’s hidden state delivers blockwise bidirectionality implicitly.

5. Empirical Performance and Complexity

The Bi-WKV approach, including fusion via a learnable sigmoid gate and local convshift residual, yields the following results on the AudioSet-2M mAP task (Xiong et al., 2 Sep 2025):

Causal RWKV7 baseline: 34.5 mAP
Adding bidirectional scan (average fusion): 38.4 mAP
Weighted gate fusion: 39.0 mAP
Full ConvShift + Bi-WKV: 40.9 mAP

AudioRWKV’s RWKV7 operator with Bi-WKV in spatial mixing provides up to a 13.3× speedup over FlashAttention on a 4090 GPU with long ( $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 55 min) audio, with throughput remaining flat as sequence length increases.

For B³D-RWKV (Lin et al., 25 May 2026), throughput experiments on H100 with a 32-layer d=4096 RWKV-7-g1f-7.2B backbone in 8-GPU training demonstrate:

B³D-RWKV-7.2B achieves decoding throughput at $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 61.6× that of a standard causal RWKV-7-7.2B at comparable quality, reaching up to 2.0× with fewer inference steps or lower confidence thresholds.

Both methods attain substantial gains in bidirectional context efficiency and accuracy compared to their causal RWKV counterparts.

6. Implementation Notes and Stability

For Bi-WKV (Xiong et al., 2 Sep 2025):

All bidirectional extensions follow numerically stable RWKV7 recurrences, leveraging double-exponential decay ( $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 7) for boundedness.
The fusion gate $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 8 is obtained from a convolutional residual, introducing minimal instability. No new unstable operators are introduced.
Training and inference memory requirements remain linear in sequence length, with minor constants due to dual scans and gating.

For triplet-block B³D-RWKV (Lin et al., 25 May 2026):

The approach is parameter-free at the architectural level. No new layers, parameters, or hidden state representations are involved.
All weight matrices and state transitions remain as in the vanilla RWKV backbone.
The only modification occurs at data layout: token sequencing and masking regime.
During training, $p^{\text{fused}}_t = G_t \circ p^{\rightarrow}_t + (1 - G_t) \circ p^{\leftarrow}_t$ 9 physical tokens per sequence yields a 3× overhead, still linear in $\circ$ 0; at inference amortized cost per block is $\circ$ 1 due to token commitment shortcuts.

7. Applications and Broader Impact

Bi-RWKV layers extend the applicability of the RWKV architecture to tasks requiring global or block-local bidirectional context. Bi-WKV is primarily used for audio pattern recognition, where full-sequence context is critical and efficient long-range modeling is required. The triplet-block system enables linear-time, parallelizable, discrete diffusion for language modeling, closing the gap between diffusion-based bidirectional inference and causally trained backbone models, while preserving the efficiency of RWKV.

These innovations enable high-throughput, stable, and generalizable modeling for both language and audio domains, leveraging bidirectional context without incurring quadratic attention cost or destabilizing recurrent recurrences. The Bi-WKV and triplet-block pseudobiaccess approaches thus represent significant advances in efficient sequence modeling architectures (Xiong et al., 2 Sep 2025, Lin et al., 25 May 2026).

Markdown Report Issue Upgrade to Chat

References (2)

AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition (2025)

Triplet-Block Diffusion RWKV (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-RWKV Layers.