Papers
Topics
Authors
Recent
Search
2000 character limit reached

Res2Net Multi-Scale Convolution Module

Updated 8 May 2026
  • Res2Net multi-scale convolution is a neural module that partitions feature channels into groups and applies hierarchical convolutions, capturing both fine and coarse patterns in a single block.
  • The design provides efficient multi-scale feature extraction with competitive parameter counts and minimal computational overhead, benefiting tasks such as image classification and time series analysis.
  • The gated extension, GRes2Net, introduces learnable gates to selectively modulate inter-group information flow, leading to improved performance in diverse applications.

Res2Net multi-scale convolution is a neural module that generalizes the conventional residual bottleneck by incorporating hierarchical, residual-like connections at a granular level within a single block. This design fundamentally augments the multi-scale representation capability of convolutional networks by enabling each block to capture features at multiple, fine-grained receptive fields. Res2Net has been shown to enhance performance across diverse tasks, including image classification, object detection, audio processing, and multivariate time series analysis, and can be further extended with gating mechanisms for improved control of information flow.

1. Res2Net Multi-Scale Convolution Module

The Res2Net block partitions the feature channels into multiple groups and links these via a sequence of residual-like, hierarchical convolutions. Given an input tensor XRC×H×WX \in \mathbb{R}^{C \times H \times W} (or XRC×LX \in \mathbb{R}^{C \times L} for 1D data), the process includes:

  1. Channel Expansion: Apply a 1×11 \times 1 convolution to project the input to an intermediate feature map with C=s×wC' = s \times w channels, where ss is the number of scale groups and ww is the number of channels per group.
  2. Group Splitting: Divide the intermediate feature map into ss groups: {x1,x2,,xs}\{x_1, x_2, \ldots, x_s\}, each xiRw×H×Wx_i \in \mathbb{R}^{w \times H \times W}.
  3. Hierarchical Convolutions: For each group ii, compute the output as

XRC×LX \in \mathbb{R}^{C \times L}0

where each XRC×LX \in \mathbb{R}^{C \times L}1 is a learnable XRC×LX \in \mathbb{R}^{C \times L}2 convolution (or 1D convolution for time series).

  1. Fusion: Concatenate XRC×LX \in \mathbb{R}^{C \times L}3 along the channel axis and apply a final XRC×LX \in \mathbb{R}^{C \times L}4 convolution for channel fusion and compression.
  2. Residual Addition: Add the original input tensor as in standard ResNets.

This module exposes each channel group to a different effective receptive field: XRC×LX \in \mathbb{R}^{C \times L}5 allowing simultaneous extraction of both local and global patterns within a single block (Gao et al., 2019).

2. Theoretical Advantages and Parameter Efficiency

The granular multi-scale design of Res2Net introduces a “scale” dimension in addition to the traditional “depth,” “width,” and “cardinality.” When compared to standard bottleneck structures:

  • The module realizes multiple effective receptive fields within each block due to the cascaded arrangement of residual links.
  • Parameter count for the collection of XRC×LX \in \mathbb{R}^{C \times L}6 convolutions is

XRC×LX \in \mathbb{R}^{C \times L}7

which is lower than the single-path design of a standard ResNet bottleneck when XRC×LX \in \mathbb{R}^{C \times L}8 and XRC×LX \in \mathbb{R}^{C \times L}9 (Li et al., 2020).

  • FLOPs and memory overhead remain close to baseline ResNet architectures for typical configurations (1×11 \times 10), and runtime increases are marginal (1×11 \times 11 on 2241×11 \times 12224 images).

A key empirical finding is that increasing the scale 1×11 \times 13, rather than only the width or depth, yields superior performance gains for fixed computational budgets (Gao et al., 2019).

3. Gated Res2Net Extension (GRes2Net)

The GRes2Net module extends the standard Res2Net block by integrating dynamic, learnable gates. For each residual connection (1×11 \times 14 added to 1×11 \times 15), a gate 1×11 \times 16 is computed via a small subnetwork 1×11 \times 17:

1×11 \times 18

The modified hierarchical computation becomes:

1×11 \times 19

where C=s×wC' = s \times w0 denotes elementwise multiplication. This enables explicit modeling of inter-channel correlations and selective modulation of cross-scale information transfer, mitigating the risk of propagating irrelevant or noisy signals across scales. Empirical results show consistent gains from gating in multivariate time series tasks (Yang et al., 2020).

4. Instantiations and Integration in Deep Backbones

Res2Net blocks have been deployed in varied forms across computer vision, speech, and time series domains:

  • Vision Backbones: Used to replace the central C=s×wC' = s \times w1 convolution in residual blocks of ResNet, ResNeXt (scale C=s×wC' = s \times w2 orthogonal to cardinality), and DLA architectures (Gao et al., 2019).
  • Speaker Verification: Employed in deep CNNs for text-independent speaker embedding extraction. Scale increments from C=s×wC' = s \times w3 up to C=s×wC' = s \times w4 deliver significant improvements in equal error rate (EER), particularly for short utterances (Zhou et al., 2020).
  • Time Series Analysis: The GRes2Net structure forms the backbone for modeling both classification and forecasting tasks, with a multi-block stack, temporal pooling, and fully connected output layers (Yang et al., 2020).
  • Speech Anti-Spoofing: Applied for adaptive fusion of multi-scale features, demonstrating robustness to unseen synthetic and replayed speech attacks. Integration with other modules, such as Squeeze-and-Excitation (SE) blocks, further enhances performance (Li et al., 2020).

5. Empirical Evaluation and Comparative Results

Experiments on benchmark tasks confirm the multi-scale convolution’s practical advantages.

Architecture Task / Dataset Metric ResNet Res2Net GRes2Net
ResNet vs. Res2Net ImageNet Classification top-1 err (%) 23.85 22.01
Speaker Verification VoxCeleb1-test (2 s) EER (%) 6.77 5.58
Time Series EGG (classification) Accuracy (%) 91.50 91.50 92.76
Time Series Appliance Forecasting RMSE/MAE/R² 13.98/7.64/0.97 13.98/7.64/0.97 12.84/6.99/0.98

On vision tasks, Res2Net-50 reduces top-1 error from 23.85% (ResNet) to 22.01% on ImageNet (Gao et al., 2019). On speech, Res2Net achieves 17.6% relative EER reduction on short VoxCeleb1 trials (Zhou et al., 2020). GRes2Net further improves classification/forecasting results over vanilla Res2Net in multivariate time series scenarios (Yang et al., 2020).

6. Context, Extensions, and Significance

Res2Net’s hierarchical multi-scale design distinguishes itself from previous efforts by embedding diverse receptive fields within each block, rather than only at different layers. This enables the block to represent fine- and coarse-grained context jointly and adaptively, a property especially beneficial when variable scale or context information is crucial—such as small- and large-object detection, or time-series with multiple temporal dependencies.

The GRes2Net extension introduces learned gates to dynamically control intra-block information flow, capturing complex inter-channel dependencies. This approach is particularly effective in domains with substantial channel-wise or temporal correlations.

Integration with established modules (e.g., Squeeze-and-Excitation), orthogonal scale and cardinality control in Res2NeXt, and consistent cross-domain performance gains underscore the versatility and significance of the Res2Net multi-scale convolutional paradigm (Gao et al., 2019, Yang et al., 2020, Zhou et al., 2020, Li et al., 2020).

7. Notation and Implementation Summary

Key notation for the Res2Net (and GRes2Net) module:

Symbol Meaning
C=s×wC' = s \times w5 Input feature map (C=s×wC' = s \times w6 or C=s×wC' = s \times w7)
C=s×wC' = s \times w8 Number of channel groups (scales)
C=s×wC' = s \times w9 Channels per group; ss0
ss1 ss2th group (ss3, ss4)
ss5 ss6 conv per group
ss7 Output of scale ss8
ss9 Gate (GRes2Net) for inter-group connection

Each block can be reconstructed using the split–hierarchical–concat–compress flow and, if gated, with the dynamic computation of ww0 as described above (Gao et al., 2019, Yang et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Res2Net Multi-Scale Convolution.