Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-path Self-Attention RNN

Updated 14 February 2026
  • The paper introduces a dual-path framework that integrates self-attention with recurrent modules to efficiently capture both short- and long-range dependencies.
  • It employs a hierarchical, chunked processing approach—using intra-chunk and inter-chunk stages—to optimize computational efficiency and reduce latency in sequence modeling.
  • Empirical results show that DP-SARNN achieves state-of-the-art performance in real-time speech enhancement and audio-visual fusion, with marked improvements in STOI and PESQ metrics.

A Dual-path Self-Attention Recurrent Neural Network (DP-SARNN) is a neural architecture that introduces self-attention mechanisms into the dual-path recurrent neural network framework for efficient modeling of both local and long-range dependencies in sequential data, with particular utility in real-time time-domain speech enhancement and audio-visual speech extraction. The architecture interleaves recurrent and attention-based modules, operating along intra-chunk (local) and inter-chunk (global) axes, achieving state-of-the-art performance with reduced computational overhead and algorithmic latency. DP-SARNN represents key advances over both pure RNN-based and transformer-style dual-path architectures for sequence processing in audio and multimodal fusion frameworks (Pandey et al., 2020, Xu et al., 2022).

1. Principle of Dual-Path Chunked Processing

DP-SARNN operates on chunked input representations, segmenting the time sequence into overlapping chunks to facilitate separate modeling of short-range and long-range temporal dependencies. Let the input X∈RT×LX \in \mathbb{R}^{T \times L} (with TT frames and LL-dimensional features) be divided into JJ segments (chunks) of length KK with stride PP, producing

X∈RJ×K×L,Xj,k=X(j−1)P+k,:\mathbf{X} \in \mathbb{R}^{J \times K \times L}, \quad \mathbf{X}_{j,k} = X_{(j-1)P + k, :}

Within each DP-SARNN block, operations are applied in two stages:

  1. Intra-chunk: Each chunk (of KK frames) is processed independently along the time axis to model fine-grained, local context.
  2. Inter-chunk: At each frame position kk, features across all chunks are processed to learn global or long-range dependencies.

This decomposition yields a hierarchical processing flow, critical for both memory and computational efficiency in long sequential data (Pandey et al., 2020).

2. Self-Attention RNN (SARNN) Module

A central innovation in DP-SARNN is replacing conventional RNNs in both intra- and inter-chunk modules with Self-Attention RNN (SARNN) blocks. Each SARNN block employs the following computation sequence:

  • LayerNorm
  • (B)LSTM (or LSTM depending on causality)
  • Linear projection
  • LayerNorm
  • Single-headed efficient gated scaled dot-product attention
  • Add & Norm
  • Two-layer MLP with GELU activation (channel size $4N$)
  • Add & Norm

The self-attention mechanism within each SARNN block is defined as: PP1 where TT0 are trainable gating vectors (broadcast across TT1).

For causal, real-time settings, attention weights are masked to ensure only current and past frames are included: PP2 This ensures strict online processing for inter-chunk SARNN (Pandey et al., 2020).

3. End-to-End Pipeline and Data Flow

The stacked DP-SARNN blocks process the chunked representation as follows:

  1. Apply intra-chunk SARNN to each chunk (TT2), obtaining TT3.
  2. Transpose to TT4.
  3. Apply inter-chunk SARNN for each position, obtaining TT5.
  4. Transpose back to TT6. After stacking several such blocks and a linear output layer, overlap-add (OLA) reconstruction is performed at both frame and chunk level to recover the enhanced time-domain waveform.

4. Architectural Hyperparameters and Implementation

Key configuration for real-time speech enhancement includes:

  • Input projection: TT7 channels
  • Six DP-SARNN blocks with (B)LSTM layers of hidden size TT8
  • Single-head attention, FFN size TT9, dropout rate LL0
  • Frame size LL1 samples (1 ms) or LL2 ms (real-time), frame shift LL3 or LL4 ms
  • Chunk size LL5 frames (LL632 ms), shift LL7 (half-overlap)
  • End-to-end latency LL832 ms; per 32 ms chunk CPU time: 7.9 ms
  • Adam optimizer, initial learning rate LL9, PCM loss, gradient norm clip at 3, mixed-precision training (Pandey et al., 2020)

5. Audio-Visual Extension and Dual-Path Cross-Modal Attention

In the audio-visual speech extraction context, DP-SARNN is extended to support multimodal fusion (Xu et al., 2022):

  • Audio features from time-domain convolutional encoder are chunked as before.
  • Visual features are extracted per video frame (MTCNN + FaceNet), dimension JJ0, with natural alignment to audio chunks (JJ1), so up/downsampling is unnecessary.
  • "Dual-Path Attention" blocks stack JJ2 times, with each block producing updated audio (JJ3) and video (JJ4) representations through intra-chunk, inter-chunk, and cross-modal attention.

The inter-chunk block conducts:

  • Audio self-attention across chunks for each position.
  • Video self-attention across visual frames.
  • Cross-modal attention: audio features are pooled across chunk, then attended to by video and vice versa; fused outputs are added back to respective streams and linearly projected.
  • All Q/K/V projections are linear layers; no explicit positional encoding.

Key hyperparameters for AV extraction:

  • 3 dual-path blocks; 4 intra- and inter-chunk layers per block
  • Audio chunk size JJ5 (16 ms window, 8 ms hop)
  • Audio JJ6, video JJ7, MHA with JJ8, JJ9, FFN KK0
  • No dropout; regularization via residuals, LayerNorm, local attention masking (Xu et al., 2022)

6. Performance, Efficiency, and Empirical Results

DP-SARNN demonstrates the following empirical results:

  • On noisy WSJ0 (babble, cafeteria, SNR KK1 to KK2 dB): STOI KK3, PESQ KK4 (non-causal), outperforming DP-RNN by KK5 STOI and KK6 PESQ.
  • Causal DP-SARNN (DP-SALSTM): STOI KK7, PESQ KK8, CPU runtime per 32 ms chunk 7.9 ms, model size 6.49M params
  • Ablations: attention enables 4× larger frame shift (reducing compute) with no loss; removing attention degrades PESQ by KK9; larger chunk shift provides balance between context and latency.
  • In audio-visual extraction, DP-SARNN-based cross-modal fusion yields PP0 dB SI-SNR improvement over ConvTasNet and AV-ConvTasNet, especially in multi-interferer mixtures (Pandey et al., 2020, Xu et al., 2022).

DP-SARNN generalizes prior sequential dual-path approaches (DPRNN) by augmenting both intra- and inter-chunk recurrence with attention-based modeling, or fully replacing RNNs with transformer blocks in some variants, as in audio-visual fusion work. In multichannel enhancement, a third spatial path can be added as in TPARN, where each channel is processed independently by a dual-path (self-attentive) recurrent network, then aggregated via an additional spatial path (Pandey et al., 2021). A plausible implication is that DP-SARNN and its extensions provide a versatile backbone for various monaural, multichannel, and multimodal sequence enhancement and separation tasks. The dual-path chunked structure is also widely adopted in high-performance, low-latency, and real-time sequence models across speech and broader sequential domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-path Self-Attention RNN (DP-SARNN).