Papers
Topics
Authors
Recent
Search
2000 character limit reached

SE-Res2Net Block: Multi-Scale Channel Attention

Updated 28 March 2026
  • SE-Res2Net Block is a neural module that merges multi-scale feature extraction via hierarchical convolutions with SE-based channel recalibration.
  • It uses fine-grained channel splitting and recursive 3x3 convolutions to emulate diverse receptive fields for both local and global feature capture.
  • Empirical results confirm its parameter efficiency and improved robustness in detecting replay and synthetic speech attacks in audio forensics.

The SE-Res2Net block is a neural building block that synergistically combines fine-grained multi-scale feature extraction (Res2Net) with adaptive channel recalibration (Squeeze-and-Excitation, SE) within a residual learning framework. It has demonstrated strong empirical performance and parameter efficiency across voice anti-spoofing and related audio forensics benchmarks, especially in the context of robust detection against replay and synthetic speech attacks (Li et al., 2020, Xue et al., 2021, Wang et al., 2022).

1. Architectural Overview

The SE-Res2Net block extends the traditional ResNet bottleneck design through two major modifications: multi-scale residual learning across channel groups and explicit channel-wise gating. For an input tensor XRC×H×WX \in \mathbb{R}^{C \times H \times W}:

  • Multi-scale decomposition: The input is first projected by a 1×11 \times 1 convolution to a channel size DD and split into ss equally sized subgroups {xi}i=1s\{x_i\}_{i=1}^s along the channel dimension.
  • Hierarchical residual convolutions: Each xix_i is recursively combined with the output from the previous path and transformed by a shared-parameter 3×33\times 3 convolution, constructing y1=K1(x1),yi=Ki(xi+yi1)y_1 = K_1(x_1), y_i = K_i(x_i + y_{i-1}) for i=2,,si=2,\dots,s.
  • Re-aggregation: The ss output streams are concatenated and projected back to CC channels via another 1×11 \times 1 convolution to yield YY.
  • SE recalibration: The SE module is applied to YY, computing channel attention weights using global average pooling and a two-layer MLP; this output SS is then channel-wise multiplied to recalibrate YY.
  • Residual addition: The block output is the sum X+SX + S (plus optional normalization and/or non-linearity).

Explicit formulaic description (Li et al., 2020, Xue et al., 2021): U=Conv1×1(X) xi=Slicei(U),i=1,,s y1=K1(x1) yi=Ki(xi+yi1),i=2,,s Y=Concat(y1,,ys) Y=Conv1×1(Y) S=SE(Y) Output=X+S\begin{align*} U &= \mathrm{Conv}_{1 \times 1}(X) \ x_i &= \mathrm{Slice}_i(U), \quad i = 1, \dots, s \ y_1 &= K_1(x_1) \ y_i &= K_i(x_i + y_{i-1}), \quad i = 2, \dots, s \ Y' &= \mathrm{Concat}(y_1, \dots, y_s) \ Y &= \mathrm{Conv}_{1 \times 1}(Y') \ S &= \mathrm{SE}(Y) \ \mathrm{Output} &= X + S \end{align*}

2. Functional Components

Res2Net Multi-Scale Convolution

By splitting the channel space and deploying hierarchical residual 3×33 \times 3 convolutions per group, the block realizes diverse effective receptive fields in parallel. For scale factor ss, the iith group’s convolutional path covers up to ii cascaded 3×33 \times 3 convolutions, emulating receptive fields of size 3×33 \times 3, 5×55 \times 5, up to (2s1)×(2s1)(2s-1) \times (2s-1) within a single block (Li et al., 2020, Xue et al., 2021, Wang et al., 2022). This design enables highly granular feature extraction across both local and more global contexts.

Squeeze-and-Excitation (SE) Module

The SE unit performs global-average “squeeze” pooling along spatial dimensions to compute a channel descriptor zRCz \in \mathbb{R}^C. This is followed by two linear transformations with a reduction ratio rr (typically 8 or 16), first compressing to C/rC/r and then expanding back, interleaved with ReLU and sigmoid activations: a=ReLU(W1z),e=σ(W2a)a = \mathrm{ReLU}(W_1 z), \quad e = \sigma(W_2 a) where e(0,1)Ce \in (0,1)^C serves as per-channel attention coefficients. Each feature channel is scaled by the corresponding entry in ee before the residual addition (Li et al., 2020, Xue et al., 2021, Wang et al., 2022).

3. Mathematical Formulation

Given input XRC×H×WX \in \mathbb{R}^{C \times H \times W}, the process is as follows (for ss channel groups, reduction ratio rr):

  1. Bottleneck projection:

    U=Conv1×1(X)U = \mathrm{Conv}_{1\times1}(X)

  2. Split into channel groups:

    U=[x1,x2,,xs],xiRd×H×W,d=D/sU = [x_1, x_2, \dots, x_s], \quad x_i \in \mathbb{R}^{d \times H \times W}, \quad d = D/s

  3. Hierarchical convolution:

    y1=K1(x1),yi=Ki(xi+yi1),i=2,...,sy_1 = K_1(x_1), \quad y_i = K_i(x_i + y_{i-1}), \quad i = 2, ..., s

  4. Concatenate and fuse:

    Y=[y1,...,ys],Y=Conv1×1(Y)Y' = [y_1, ..., y_s], \quad Y = \mathrm{Conv}_{1\times1}(Y')

  5. SE recalibration:

    zc=1HWu=1Hv=1WYc(u,v) a=ReLU(W1z),e=σ(W2a) Sc(u,v)=ecYc(u,v)z_c = \frac{1}{H W} \sum_{u=1}^H \sum_{v=1}^W Y_c(u, v) \ a = \mathrm{ReLU}(W_1 z), \quad e = \sigma(W_2 a) \ S_c(u,v) = e_c \cdot Y_c(u, v)

  6. Residual addition:

    Output=X+S\mathrm{Output} = X + S

4. Empirical Behavior and Parameter Efficiency

Empirical comparisons on the ASVspoof 2019 corpus demonstrate that SE-Res2Net architectures can achieve improved generalizability and accuracy in both physical access (PA) and logical access (LA) spoofing detection scenarios (Li et al., 2020, Xue et al., 2021, Wang et al., 2022). The block provides these benefits with lower parameter counts than canonical ResNet variants:

Model Parameters (M) Comparative Size
ResNet34 1.33 Baseline
ResNet50 1.05 –21% vs ResNet34
Res2Net50 0.88 –16.2% vs ResNet50
SE-Res2Net50 0.92 +0.04M (SE overhead) vs Res2Net50

The modest overhead of the SE block (\sim0.04M parameters) further amplifies performance gains by introducing channel interdependencies not captured by the spatial convolutions alone (Li et al., 2020).

5. Design and Implementation Considerations

Key architectural and implementation aspects include:

  • Scale factor ss: Typically set between 2 and 8; increasing ss provides finer multi-scale granularity but incurs more 3×33\times3 convolution paths.
  • SE reduction ratio rr: Commonly set to 8 or 16, balancing adaptivity and parameter growth.
  • Normalization: Batch normalization follows each convolution; ReLU activations are applied after convolutional layers and after the final addition (Wang et al., 2022).
  • Residual matching: When input and output channel counts differ, the residual branch is projected with a 1×11\times1 convolution plus batch normalization before addition to ensure dimensional consistency (Wang et al., 2022).
  • Initialization: Weight initialization for the SE MLP is recommended (e.g., He initialization); the bias of the second SE FC layer is often initialized near zero (Wang et al., 2022).

A summarized pseudocode sketch is given in (Xue et al., 2021):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def SE_Res2Net_Block(X, s, r):
    X0 = Conv1x1(X)
    Xs = Split(X0, s)
    Y1 = Conv3x3_1(Xs[0])
    Ys = [Y1]
    for i in range(1, s):
        Yi = Conv3x3_i(Xs[i] + Ys[i-1])
        Ys.append(Yi)
    U = Concat(Ys)
    R = Conv1x1_fuse(U)
    z = GlobalAvgPool(R)
    e = sigmoid(W2 @ ReLU(W1 @ z))
    R_se = R * expand(e)
    return R_se + X

6. Applications and Integration

The SE-Res2Net block is used as the main convolutional backbone in anti-spoofing systems, speech synthesis detection, and audio forensics. For example, in the SE-Res2Net-Conformer architecture, stacked SE-Res2Net blocks extract local and multi-scale time-frequency patterns from spectro-temporal features (e.g., CQT, log-mel), which are then passed to Conformer layers that capture long-range temporal dependencies (Wang et al., 2022). In multi-modal voice spoofing detection pipelines, SE-Res2Net acts as the physical (acoustic) feature extractor, integrated with other modules such as densely connected CNNs with SE for physiological signal processing (Xue et al., 2021).

The combination of diverse local receptive fields, explicit cross-channel modeling, and residual learning confers robustness to previously unseen spoofing artifacts and allows for the efficient deployment of high-performing models on resource-constrained platforms (Li et al., 2020, Xue et al., 2021, Wang et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SE-Res2Net Block.