Diffusion-RWKV: Efficient Generative Models

Updated 23 June 2026

Diffusion-RWKV are generative models that combine RWKV’s RNN-inspired, linear-complexity design with continuous and discrete diffusion processes for efficient, high-resolution synthesis.
They replace traditional Transformer/CNN U-Nets with RWKV-like layers, achieving global context through time-mix recurrence and enabling parallel denoising with linear scaling.
Recent variants incorporate specialized modules such as CrossWKV, Bi-RWKV, and triplet-block layouts to enhance cross-modal alignment, text-to-image synthesis, and discrete token generation.

Diffusion-RWKV denotes a family of generative models that integrate the linear-complexity, RNN-inspired RWKV architecture with both continuous and discrete diffusion processes for image and language generation. Originating from the time-mix (WKV) mechanism of RWKV [Peng et al., 2023], these models substitute Transformer-based or CNN-based U-Nets in diffusion pipelines with RWKV-like layers, achieving global context with O(L·D) or O(T·D) scaling while supporting parallel denoising, efficient high-resolution synthesis, and strong cross-modal alignment. Recent variants extend the approach to multimodal tasks and discrete token generation by introducing specialized mechanisms such as the CrossWKV module (Xiao et al., 19 Apr 2025), Bi-RWKV layers (Fei et al., 2024), and the triplet-block data layout (Lin et al., 25 May 2026).

1. Foundations: RWKV Block and Its Adaptation to Diffusion

The cornerstone of Diffusion-RWKV is the Weighted Key-Value (WKV) time-mix recurrence, which computes a soft, exponentially weighted aggregation of keys and values with strictly O(L·D) complexity per layer and obviates the quadratic bottlenecks of self-attention. Each RWKV block comprises two submodules:

Time-Mix/WKV: Implements a sequence of decayed aggregations via exponential smoothing in a rolling fashion, capturing global dependencies through a recurrence over per-token projections. For an input token $x_t \in \mathbb{R}^D$ , projections are linearly mixed between $x_t$ and $x_{t-1}$ and then recurrently aggregated to form hidden states.
Channel-Mix: Applies a further linear mix between channels at consecutive timesteps, followed by gated MLPs.

For diffusion, images are patchified and embedded into sequences, with RWKV-style layers replacing the U-Net or Transformer backbone (Fei et al., 2024).

2. Diffusion-RWKV in Continuous Generative Modeling

A canonical Diffusion-RWKV model adopts the variance-preserving DDPM framework, using stacks of (bi-)directional RWKV blocks:

Image Patchification and Embedding: The input image $I \in \mathbb{R}^{H \times W \times C}$ is split into non-overlapping patches, producing a sequence $X \in \mathbb{R}^{J \times (p^2C)}$ which is embedded and combined with positional information.
Bidirectional and Quad-Directional Recurrence (Bi-RWKV): Each layer applies a quad-directional shift, bidirectional recurrence along spatially-flattened sequences, and skip connections.
Conditioning: Class labels and timesteps are injected via in-context tokens, adaptive LayerNorm (adaLN), or adaLN-Zero schemes.
Diffusion Process: Standard forward (noising) and reverse (denoising) processes are applied, with the model parameterizing the noise prediction $\epsilon_\theta(x_t, t, c)$ for DDPM's $\ell_2$ loss.
Linear Complexity: Each WKV step updates O(D) accumulators, yielding total cost O(L·J·D) for L layers. Unlike Transformer attention's O(J²D), the architecture avoids windowing or locality constraints, enabling direct high-resolution generation (Fei et al., 2024).

The DIR-7 (Diffusion in RWKV-7) framework integrates the RWKV-7 backbone with a CrossWKV module for efficient text-to-image synthesis (Xiao et al., 19 Apr 2025). The main components are:

CrossWKV Cross-Attention: Simultaneously fuses CLIP-encoded text ( $q$ ) and image tokens ( $x$ ) in a single unidirectional pass. The transition matrix $T_t$ at each step is non-diagonal and input-dependent, realized via a generalized delta-rule recurrence with vector-valued gating.
Low-Rank Adaptations (LoRA): Decay, rate, and gate parameters are refined via LoRA modules of varying ranks (e.g., 64 for decay/rate, 128 for gate, 16 for value blending), enabling parameter-efficient adaptation.
Normalization and Output: Deep layers employ GroupNorm and value blending for stability. The WKV kernel operates either in chunked or fused fashion.
State-Tracking and Expressivity: RWKV-7’s transition matrices support computations beyond $x_t$ 0, allowing modeling of arbitrary regular languages, demonstrated via $x_t$ 1 permutation tasks. This results in dynamic memory and long-range coherence for scene and board-game generation.

DIR-7 achieves strong empirical performance: FID of 2.88 and CLIP score of 0.33 on ImageNet 256×256, matching DiT-XL/2, but with substantially lower memory (4.5 GB vs. 6.5 GB) and 0.52 s inference time (vs. 0.85 s) (Xiao et al., 19 Apr 2025).

4. Discrete Diffusion and Triplet-Block RWKV

The B³D-RWKV variant generalizes Diffusion-RWKV to blockwise discrete diffusion over tokens for tasks such as language modeling (Lin et al., 25 May 2026). Its key elements are:

Triplet-Block Layout: Each logical block of length $x_t$ 2 is represented as three physical blocks: two masked (b₁, b₂), one ground-truth (b₃). Causality is preserved by the RWKV’s unidirectional recurrence, while bidirectional context is injected via interleaving and masking patterns.
Loss Functions: A cross-entropy loss is applied to supervised positions in b₂, supplemented by a CAP (Confidence-Aware Parallel) sharpening loss to enforce confident distributional predictions.
Parallel Denoising: During inference, masked tokens are filled in parallel via blockwise iterative denoising, trading off speed and accuracy through thresholding and minimal commitments.
Complexity: The inference regime is strictly O(L) per sequence, enabling 1.6× mean decoding speedup over vanilla RWKV-7 ( $x_t$ 3222 tok/s vs. 138 tok/s at equal quality), contrasting favorably with the O(L²) cost of Transformer-based diffusion models.

5. Computational Complexity and Scaling

A consistent theme across the Diffusion-RWKV variants is linear scaling with sequence or spatial extent. The core comparison is:

Model	Attention Complexity	Memory Growth	FLOPs (ImageNet 256×256)
Transformer (DiT-XL/2)	$x_t$ 4	$x_t$ 5	$x_t$ 6 GFLOPs
Diffusion-RWKV (DIR-7-H)	$x_t$ 7	Slight, linear	$x_t$ 8 GFLOPs

For language, B³D-RWKV multiplies training sequence length by 3 for triplets, but inference does not incur quadratic attention cost and maintains full parallelism within each block (Lin et al., 25 May 2026).

6. Empirical Performance

Across both vision and language tasks:

Vision (ImageNet 256/512×512, LAION-5B): Diffusion-RWKV matches or exceeds FID and sFID of CNN and Transformer alternatives at markedly lower FLOPs. Example: DRWKV-H/2 attains FID 2.16, sFID 4.58 on ImageNet 256 at 160 GFLOPs, surpassing DiT-XL/2 at 213 GFLOPs (Fei et al., 2024). DIR-7-H achieves FID 2.88, CLIP 0.33, and visually competitive human evaluation scores.
Language (B³D-RWKV): Performance is on par with strong causal RWKV and blockwise diffusion baselines on standard reasoning (MMLU, ARC-Challenge, ARC-Easy, PIQA, RACE), math (GSM8K, MATH), and generalization tasks. Throughput improvements are empirically validated (Lin et al., 25 May 2026).
Ablations: Removal of LoRA gating or norm layers in the CrossWKV module degrades FID and CLIP alignment, confirming the necessity of the LoRA-enhanced gates.

7. Limitations and Research Frontiers

Although Diffusion-RWKV architectures eliminate the spatial aggregation bottleneck and facilitate high-resolution synthesis, there are known limitations:

Slight deficit in ultimate FID/IS when compared to heavily-optimized Transformer variants (e.g., SiT).
Orthogonality to advanced vision techniques such as multi-scale skip U-Nets or adaptive sampling remains.
Discrete diffusion variants require 3× training sequence length and their efficacy hinges on robust triplet-block masking; early-commitment heuristics are essential for practical inference speed (Lin et al., 25 May 2026).
Future avenues include hybridizing RWKV and Transformer blocks, improved samplers (DPM solvers), quantization/compression of the WKV kernel, and extension to multimodal and video domains (Fei et al., 2024).

Diffusion-RWKV frameworks unify O(L) RNN-style backbones with blockwise, parallel denoising and expressive cross-modal integration, offering an efficient alternative for high-resolution generation and sequence modeling with transparent scaling properties (Fei et al., 2024, Xiao et al., 19 Apr 2025, Lin et al., 25 May 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Cross-attention for State-based model RWKV-7 (2025)

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models (2024)

Triplet-Block Diffusion RWKV (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion-RWKV.