1D Res2Net Modules for Sequence Modeling
- 1-Dimensional Res2Net modules extend ResNet bottlenecks by splitting channels into multiple scales to capture both fine and coarse temporal features.
- Gated variants such as CG-Res2Net and GRes2Net dynamically modulate inter-channel information flow, enhancing multi-scale feature representation.
- This architecture improves efficiency and accuracy in sequence modeling tasks like synthetic speech detection and time series analysis.
The one-dimensional (1D) Res2Net module is an architectural extension of the ResNet bottleneck block designed specifically for sequence modeling tasks, such as time series analysis and 1D signal processing. It generalizes multi-scale representation learning to the temporal or sequential domain by constructing hierarchical residual-like connections across distinct channel groups within each block. This design supports flexible receptive fields and encourages both fine- and coarse-scale feature interactions, offering advantages for complex time-dependent learning scenarios.
1. Architectural Principles of the 1D Res2Net Module
The 1D Res2Net module modifies the standard ResNet bottleneck by splitting the expanded feature channels into parallel groups, termed scales (Li et al., 2021, Yang et al., 2020). Given an input , where is the number of channels and is the temporal length, a 1×1 convolution expands to , where is the number of scales and is the per-group channel width. The expanded tensor is then split into groups , each .
The inner multi-scale processing is defined recursively:
- , for
where denotes a 1D convolution on group (typically with kernel size 3 and padding 1). The outputs are concatenated along the channel axis, compressed via a second 1×1 convolution, and combined with the residual connection from , followed by batch normalization and ReLU:
(Li et al., 2021, Yang et al., 2020).
2. Gated Extensions and Channel-wise Control
To enhance the ability of the block to selectively propagate information across groups and better handle inter-channel correlations, gated variants have been proposed.
- CG-Res2Net (Li et al., 2021) introduces a channel-wise gate to modulate information flow from to . The residual-like connection is modified as:
The gate is computed via mechanisms such as global average pooling of feature maps, followed by fully connected layers and sigmoid activation.
- Gated Res2Net (GRes2Net) (Yang et al., 2020) further generalizes gating, making the gate (batch, width, length) dependent on the current input, previous output, and the block’s original input:
where and are 1×1 conv + BatchNorm + ReLU modules. The fusion at each hierarchical step is:
This formulation allows the model to learn dynamic, per-element modulation for each hierarchical connection, enhancing multi-scale temporal feature learning.
The modules may use different gating computation strategies, such as single-group and multi-group gates, with or without latent space projections, as detailed in (Li et al., 2021).
3. Implementation Details and Hyperparameterization
The main architectural parameters of 1D Res2Net and its gated variants are:
| Parameter | Typical Values | Role |
|---|---|---|
| Scales | 4 | Number of parallel groups (feature splits) in each block |
| Channels per group / | 12–32 | Number of channels per group after expansion |
| Conv kernel size | 3 | Size for the per-group 1D convolutions |
| Gating type | None / CG / G | Choice of plain (vanilla), channel-gated (CG), or full gating (GRes2Net) |
| Gate activation | Sigmoid/tanh | Controls the range for channel-wise or per-element gates |
| Channel expansion | 1×1 conv | Converts from to input to the multi-scale block |
| Channel compression | 1×1 conv | Reduces from to the desired output width after multi-scale fusion |
The 1D GRes2Net block outlined in (Yang et al., 2020) emphasizes that all convolutions (1×1, 3×1) are followed by BatchNorm and ReLU, with gating submodules kept lightweight to minimize parameter overhead.
4. Pseudocode Representation
Representative PyTorch-style pseudocode for a generic 1D gated Res2Net block is provided in (Yang et al., 2020, Li et al., 2021). The forward pass, omitting explicit gating details for brevity, follows this pattern:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
U = Conv1x1_expand(X) U = BatchNorm(U); U = ReLU(U) chunks = split_channels(U, groups=s) y = [] for i, xi in enumerate(chunks): if i == 0: yi = xi elif i == 1: yi = K2_conv(xi); yi = BatchNorm(yi); yi = ReLU(yi) else: g = compute_gate(X, y[i-1], xi) # e.g., via 1x1 convs + tanh fused = xi + g * y[i-1] yi = Ki_conv(fused); yi = BatchNorm(yi); yi = ReLU(yi) y.append(yi) Y_concat = concatenate(y) Y = Conv1x1_compress(Y_concat) Y = BatchNorm(Y); Y = ReLU(Y) return Y |
5. Comparative Properties and Significance
The 1D Res2Net and its gated variants introduce flexible hierarchical receptive fields within a single convolutional block, as opposed to stacking deeper layers for multi-scale aggregation. This enables:
- Simultaneous local and broad temporal receptive fields within each block.
- Dynamic, learned weighting of information passed across scales (gated variants).
- Improved efficiency for sequence modeling, as the multi-scale hierarchy within a block replaces the need for deeper stacks or auxiliary context modules.
Empirically, such architectures have demonstrated consistent gains over vanilla ResNet-style 1D CNNs for tasks including synthetic speech detection (Li et al., 2021) and multivariate time series classification/forecasting (Yang et al., 2020). Gating mechanisms provide further accuracy improvements by suppressing irrelevant information and promoting robust multi-scale dependency learning.
6. Applications and Impact
1D Res2Net modules are employed in deep learning models targeting sequential and temporal domains, notably:
- Synthetic speech artifact detection systems, where they improve generalization to unseen spoofing attacks via flexible receptive field adaptation and channel-wise selection (Li et al., 2021).
- Multivariate time series analysis, both for classification and forecasting, where hierarchical gating yields state-of-the-art performance, including more accurate temporal feature extraction and correlation modeling (Yang et al., 2020).
A plausible implication is that the gating-enhanced Res2Net structures could be particularly advantageous in domains where input variables exhibit time-dependent and context-sensitive importance, as in sensor fusion, audio, and biomedical signal processing.
7. Relation to Original Res2Net and Extensions
The one-dimensional Res2Net block can be viewed as a direct analogue to the original Res2Net architecture proposed for 2D image processing (Gao et al., CVPR 2020), but adapted to operate exclusively along the temporal or sequence axis. Unlike the image domain, 1D applications often benefit from finer gating granularity, as temporal and inter-channel dependencies are more variable and task-specific.
A notable distinction is the proliferation of gating mechanisms in 1D variants (CG-Res2Net, GRes2Net), reflecting the increased utility of per-channel or per-step selection in sequential modeling. These gating innovations have not only extended the expressive power but have empirically demonstrated value on standard sequence learning benchmarks (Li et al., 2021, Yang et al., 2020).