Res2Net Multi-Scale Convolution Module

Updated 8 May 2026

Res2Net multi-scale convolution is a neural module that partitions feature channels into groups and applies hierarchical convolutions, capturing both fine and coarse patterns in a single block.
The design provides efficient multi-scale feature extraction with competitive parameter counts and minimal computational overhead, benefiting tasks such as image classification and time series analysis.
The gated extension, GRes2Net, introduces learnable gates to selectively modulate inter-group information flow, leading to improved performance in diverse applications.

Res2Net multi-scale convolution is a neural module that generalizes the conventional residual bottleneck by incorporating hierarchical, residual-like connections at a granular level within a single block. This design fundamentally augments the multi-scale representation capability of convolutional networks by enabling each block to capture features at multiple, fine-grained receptive fields. Res2Net has been shown to enhance performance across diverse tasks, including image classification, object detection, audio processing, and multivariate time series analysis, and can be further extended with gating mechanisms for improved control of information flow.

1. Res2Net Multi-Scale Convolution Module

The Res2Net block partitions the feature channels into multiple groups and links these via a sequence of residual-like, hierarchical convolutions. Given an input tensor $X \in \mathbb{R}^{C \times H \times W}$ (or $X \in \mathbb{R}^{C \times L}$ for 1D data), the process includes:

Channel Expansion: Apply a $1 \times 1$ convolution to project the input to an intermediate feature map with $C' = s \times w$ channels, where $s$ is the number of scale groups and $w$ is the number of channels per group.
Group Splitting: Divide the intermediate feature map into $s$ groups: $\{x_1, x_2, \ldots, x_s\}$ , each $x_i \in \mathbb{R}^{w \times H \times W}$ .
Hierarchical Convolutions: For each group $i$ , compute the output as

$X \in \mathbb{R}^{C \times L}$ 0

where each $X \in \mathbb{R}^{C \times L}$ 1 is a learnable $X \in \mathbb{R}^{C \times L}$ 2 convolution (or 1D convolution for time series).

Fusion: Concatenate $X \in \mathbb{R}^{C \times L}$ 3 along the channel axis and apply a final $X \in \mathbb{R}^{C \times L}$ 4 convolution for channel fusion and compression.
Residual Addition: Add the original input tensor as in standard ResNets.

This module exposes each channel group to a different effective receptive field: $X \in \mathbb{R}^{C \times L}$ 5 allowing simultaneous extraction of both local and global patterns within a single block (Gao et al., 2019).

2. Theoretical Advantages and Parameter Efficiency

The granular multi-scale design of Res2Net introduces a “scale” dimension in addition to the traditional “depth,” “width,” and “cardinality.” When compared to standard bottleneck structures:

The module realizes multiple effective receptive fields within each block due to the cascaded arrangement of residual links.
Parameter count for the collection of $X \in \mathbb{R}^{C \times L}$ 6 convolutions is

$X \in \mathbb{R}^{C \times L}$ 7

which is lower than the single-path design of a standard ResNet bottleneck when $X \in \mathbb{R}^{C \times L}$ 8 and $X \in \mathbb{R}^{C \times L}$ 9 (Li et al., 2020).

FLOPs and memory overhead remain close to baseline ResNet architectures for typical configurations ( $1 \times 1$ 0), and runtime increases are marginal ( $1 \times 1$ 1 on 224 $1 \times 1$ 2224 images).

A key empirical finding is that increasing the scale $1 \times 1$ 3, rather than only the width or depth, yields superior performance gains for fixed computational budgets (Gao et al., 2019).

3. Gated Res2Net Extension (GRes2Net)

The GRes2Net module extends the standard Res2Net block by integrating dynamic, learnable gates. For each residual connection ( $1 \times 1$ 4 added to $1 \times 1$ 5), a gate $1 \times 1$ 6 is computed via a small subnetwork $1 \times 1$ 7:

$1 \times 1$ 8

The modified hierarchical computation becomes:

$1 \times 1$ 9

where $C' = s \times w$ 0 denotes elementwise multiplication. This enables explicit modeling of inter-channel correlations and selective modulation of cross-scale information transfer, mitigating the risk of propagating irrelevant or noisy signals across scales. Empirical results show consistent gains from gating in multivariate time series tasks (Yang et al., 2020).

4. Instantiations and Integration in Deep Backbones

Res2Net blocks have been deployed in varied forms across computer vision, speech, and time series domains:

Vision Backbones: Used to replace the central $C' = s \times w$ 1 convolution in residual blocks of ResNet, ResNeXt (scale $C' = s \times w$ 2 orthogonal to cardinality), and DLA architectures (Gao et al., 2019).
Speaker Verification: Employed in deep CNNs for text-independent speaker embedding extraction. Scale increments from $C' = s \times w$ 3 up to $C' = s \times w$ 4 deliver significant improvements in equal error rate (EER), particularly for short utterances (Zhou et al., 2020).
Time Series Analysis: The GRes2Net structure forms the backbone for modeling both classification and forecasting tasks, with a multi-block stack, temporal pooling, and fully connected output layers (Yang et al., 2020).
Speech Anti-Spoofing: Applied for adaptive fusion of multi-scale features, demonstrating robustness to unseen synthetic and replayed speech attacks. Integration with other modules, such as Squeeze-and-Excitation (SE) blocks, further enhances performance (Li et al., 2020).

5. Empirical Evaluation and Comparative Results

Experiments on benchmark tasks confirm the multi-scale convolution’s practical advantages.

Architecture	Task / Dataset	Metric	ResNet	Res2Net	GRes2Net
ResNet vs. Res2Net	ImageNet Classification	top-1 err (%)	23.85	22.01	–
Speaker Verification	VoxCeleb1-test (2 s)	EER (%)	6.77	5.58	–
Time Series	EGG (classification)	Accuracy (%)	91.50	91.50	92.76
Time Series	Appliance Forecasting	RMSE/MAE/R²	13.98/7.64/0.97	13.98/7.64/0.97	12.84/6.99/0.98

On vision tasks, Res2Net-50 reduces top-1 error from 23.85% (ResNet) to 22.01% on ImageNet (Gao et al., 2019). On speech, Res2Net achieves 17.6% relative EER reduction on short VoxCeleb1 trials (Zhou et al., 2020). GRes2Net further improves classification/forecasting results over vanilla Res2Net in multivariate time series scenarios (Yang et al., 2020).

6. Context, Extensions, and Significance

Res2Net’s hierarchical multi-scale design distinguishes itself from previous efforts by embedding diverse receptive fields within each block, rather than only at different layers. This enables the block to represent fine- and coarse-grained context jointly and adaptively, a property especially beneficial when variable scale or context information is crucial—such as small- and large-object detection, or time-series with multiple temporal dependencies.

The GRes2Net extension introduces learned gates to dynamically control intra-block information flow, capturing complex inter-channel dependencies. This approach is particularly effective in domains with substantial channel-wise or temporal correlations.

Integration with established modules (e.g., Squeeze-and-Excitation), orthogonal scale and cardinality control in Res2NeXt, and consistent cross-domain performance gains underscore the versatility and significance of the Res2Net multi-scale convolutional paradigm (Gao et al., 2019, Yang et al., 2020, Zhou et al., 2020, Li et al., 2020).

7. Notation and Implementation Summary

Key notation for the Res2Net (and GRes2Net) module:

Symbol	Meaning
$C' = s \times w$ 5	Input feature map ( $C' = s \times w$ 6 or $C' = s \times w$ 7)
$C' = s \times w$ 8	Number of channel groups (scales)
$C' = s \times w$ 9	Channels per group; $s$ 0
$s$ 1	$s$ 2th group ( $s$ 3, $s$ 4)
$s$ 5	$s$ 6 conv per group
$s$ 7	Output of scale $s$ 8
$s$ 9	Gate (GRes2Net) for inter-group connection

Each block can be reconstructed using the split–hierarchical–concat–compress flow and, if gated, with the dynamic computation of $w$ 0 as described above (Gao et al., 2019, Yang et al., 2020).

Markdown Report Issue Upgrade to Chat

References (4)

Res2Net: A New Multi-scale Backbone Architecture (2019)

Replay and Synthetic Speech Detection with Res2net Architecture (2020)

Gated Res2Net for Multivariate Time Series Analysis (2020)

ResNeXt and Res2Net Structures for Speaker Verification (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Res2Net Multi-Scale Convolution.

Res2Net Multi-Scale Convolution Module

1. Res2Net Multi-Scale Convolution Module

2. Theoretical Advantages and Parameter Efficiency

3. Gated Res2Net Extension (GRes2Net)

4. Instantiations and Integration in Deep Backbones

5. Empirical Evaluation and Comparative Results

6. Context, Extensions, and Significance

7. Notation and Implementation Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Res2Net Multi-Scale Convolution Module

1. Res2Net Multi-Scale Convolution Module

2. Theoretical Advantages and Parameter Efficiency

3. Gated Res2Net Extension (GRes2Net)

4. Instantiations and Integration in Deep Backbones

5. Empirical Evaluation and Comparative Results

6. Context, Extensions, and Significance

7. Notation and Implementation Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research