Dual-path Self-Attention RNN

Updated 14 February 2026

The paper introduces a dual-path framework that integrates self-attention with recurrent modules to efficiently capture both short- and long-range dependencies.
It employs a hierarchical, chunked processing approach—using intra-chunk and inter-chunk stages—to optimize computational efficiency and reduce latency in sequence modeling.
Empirical results show that DP-SARNN achieves state-of-the-art performance in real-time speech enhancement and audio-visual fusion, with marked improvements in STOI and PESQ metrics.

A Dual-path Self-Attention Recurrent Neural Network (DP-SARNN) is a neural architecture that introduces self-attention mechanisms into the dual-path recurrent neural network framework for efficient modeling of both local and long-range dependencies in sequential data, with particular utility in real-time time-domain speech enhancement and audio-visual speech extraction. The architecture interleaves recurrent and attention-based modules, operating along intra-chunk (local) and inter-chunk (global) axes, achieving state-of-the-art performance with reduced computational overhead and algorithmic latency. DP-SARNN represents key advances over both pure RNN-based and transformer-style dual-path architectures for sequence processing in audio and multimodal fusion frameworks (Pandey et al., 2020, Xu et al., 2022).

1. Principle of Dual-Path Chunked Processing

DP-SARNN operates on chunked input representations, segmenting the time sequence into overlapping chunks to facilitate separate modeling of short-range and long-range temporal dependencies. Let the input $X \in \mathbb{R}^{T \times L}$ (with $T$ frames and $L$ -dimensional features) be divided into $J$ segments (chunks) of length $K$ with stride $P$ , producing

$\mathbf{X} \in \mathbb{R}^{J \times K \times L}, \quad \mathbf{X}_{j,k} = X_{(j-1)P + k, :}$

Within each DP-SARNN block, operations are applied in two stages:

Intra-chunk: Each chunk (of $K$ frames) is processed independently along the time axis to model fine-grained, local context.
Inter-chunk: At each frame position $k$ , features across all chunks are processed to learn global or long-range dependencies.

This decomposition yields a hierarchical processing flow, critical for both memory and computational efficiency in long sequential data (Pandey et al., 2020).

2. Self-Attention RNN (SARNN) Module

A central innovation in DP-SARNN is replacing conventional RNNs in both intra- and inter-chunk modules with Self-Attention RNN (SARNN) blocks. Each SARNN block employs the following computation sequence:

LayerNorm
(B)LSTM (or LSTM depending on causality)
Linear projection
LayerNorm
Single-headed efficient gated scaled dot-product attention
Add & Norm
Two-layer MLP with GELU activation (channel size $4N$)
Add & Norm

The self-attention mechanism within each SARNN block is defined as: $P$ 1 where $T$ 0 are trainable gating vectors (broadcast across $T$ 1).

For causal, real-time settings, attention weights are masked to ensure only current and past frames are included: $P$ 2 This ensures strict online processing for inter-chunk SARNN (Pandey et al., 2020).

3. End-to-End Pipeline and Data Flow

The stacked DP-SARNN blocks process the chunked representation as follows:

Apply intra-chunk SARNN to each chunk ( $T$ 2), obtaining $T$ 3.
Transpose to $T$ 4.
Apply inter-chunk SARNN for each position, obtaining $T$ 5.
Transpose back to $T$ 6. After stacking several such blocks and a linear output layer, overlap-add (OLA) reconstruction is performed at both frame and chunk level to recover the enhanced time-domain waveform.

4. Architectural Hyperparameters and Implementation

Key configuration for real-time speech enhancement includes:

Input projection: $T$ 7 channels
Six DP-SARNN blocks with (B)LSTM layers of hidden size $T$ 8
Single-head attention, FFN size $T$ 9, dropout rate $L$ 0
Frame size $L$ 1 samples (1 ms) or $L$ 2 ms (real-time), frame shift $L$ 3 or $L$ 4 ms
Chunk size $L$ 5 frames ( $L$ 632 ms), shift $L$ 7 (half-overlap)
End-to-end latency $L$ 832 ms; per 32 ms chunk CPU time: 7.9 ms
Adam optimizer, initial learning rate $L$ 9, PCM loss, gradient norm clip at 3, mixed-precision training (Pandey et al., 2020)

In the audio-visual speech extraction context, DP-SARNN is extended to support multimodal fusion (Xu et al., 2022):

Audio features from time-domain convolutional encoder are chunked as before.
Visual features are extracted per video frame (MTCNN + FaceNet), dimension $J$ 0, with natural alignment to audio chunks ( $J$ 1), so up/downsampling is unnecessary.
"Dual-Path Attention" blocks stack $J$ 2 times, with each block producing updated audio ( $J$ 3) and video ( $J$ 4) representations through intra-chunk, inter-chunk, and cross-modal attention.

The inter-chunk block conducts:

Audio self-attention across chunks for each position.
Video self-attention across visual frames.
Cross-modal attention: audio features are pooled across chunk, then attended to by video and vice versa; fused outputs are added back to respective streams and linearly projected.
All Q/K/V projections are linear layers; no explicit positional encoding.

Key hyperparameters for AV extraction:

3 dual-path blocks; 4 intra- and inter-chunk layers per block
Audio chunk size $J$ 5 (16 ms window, 8 ms hop)
Audio $J$ 6, video $J$ 7, MHA with $J$ 8, $J$ 9, FFN $K$ 0
No dropout; regularization via residuals, LayerNorm, local attention masking (Xu et al., 2022)

6. Performance, Efficiency, and Empirical Results

DP-SARNN demonstrates the following empirical results:

On noisy WSJ0 (babble, cafeteria, SNR $K$ 1 to $K$ 2 dB): STOI $K$ 3, PESQ $K$ 4 (non-causal), outperforming DP-RNN by $K$ 5 STOI and $K$ 6 PESQ.
Causal DP-SARNN (DP-SALSTM): STOI $K$ 7, PESQ $K$ 8, CPU runtime per 32 ms chunk 7.9 ms, model size 6.49M params
Ablations: attention enables 4× larger frame shift (reducing compute) with no loss; removing attention degrades PESQ by $K$ 9; larger chunk shift provides balance between context and latency.
In audio-visual extraction, DP-SARNN-based cross-modal fusion yields $P$ 0 dB SI-SNR improvement over ConvTasNet and AV-ConvTasNet, especially in multi-interferer mixtures (Pandey et al., 2020, Xu et al., 2022).

DP-SARNN generalizes prior sequential dual-path approaches (DPRNN) by augmenting both intra- and inter-chunk recurrence with attention-based modeling, or fully replacing RNNs with transformer blocks in some variants, as in audio-visual fusion work. In multichannel enhancement, a third spatial path can be added as in TPARN, where each channel is processed independently by a dual-path (self-attentive) recurrent network, then aggregated via an additional spatial path (Pandey et al., 2021). A plausible implication is that DP-SARNN and its extensions provide a versatile backbone for various monaural, multichannel, and multimodal sequence enhancement and separation tasks. The dual-path chunked structure is also widely adopted in high-performance, low-latency, and real-time sequence models across speech and broader sequential domains.